Skip to main content

Full text of "Mathematical statistics"

See other formats



Published by 

Publication Committee 




expression of the desire of Mrs. Mary Hegeler Carus, and of 
herson, Dr. Edward H. Carus, to contribute to the dissemina- 
tion of mathematical knowledge by niaking accessible at nominal cost 
a series of expository presentations of the best thoughts and keenest 
researches in pure and applied mathematics. The publication of the 
first four of these monographs was made possible by a notable gift to 
the Mathematical Association of America by Mrs. Carus as sole 
trustee of the Edward C. Hegeler Trust Fund. The sales from these 
have resulted in the Carus Monograph Fund, and the Mathematical 
Association has used this as a revolving book fund to publish the 
fifth and sixth monographs. 

The expositions of mathematical subjects which the monographs 
contain are set forth in a manner comprehensible not only to teachers 
and students specializing in mathematics, but also to scientific 
workers in other fields, and especially to the wide circle of thoughtful 
people who, having a moderate acquaintance with elementary mathe- 
matics, wish to extend their knowledge without prolonged and critical 
study of the mathematical journals and treatises. The scope of this 
series includes also historical and biographical monographs. 

The following books in this series have been published to date: 

No. 1. Calculus of Variations, by Gilbert Ames Bliss. 

No. 2. Analytic Functions of a Complex Variable, by David Ray- 

No. 3. Mathematical Statistics, by Henry Lewis Rietz. 

No. 4. Projective Geometry, by John Wesley Young. 

No. 5. A History of Mathematics in America Before 1900, by David 
Eugene Smith and Jekuthiel Ginsburg. 

No. 6. Fourier Series and Orthogonal Polynomials by Dunham Jack- 

No. 7. Vectors and Matrices, by C. C. MacDuffee. 

The Cams Mathematical Monographs 




Professor of Mathematics ^ The Universiiy oj Iowa 

Published for 






CoPYBiaHT 1927 By 
The MATHEMATioAii Association ov Amk&ioa 

Published April 1927 
Second Printing 1929 

Third Printing 1936 
Fourth Printing 1943 

Fifth Printing\l947 y 

Reprinted by 
John S. Swift Co., Inc. 



This book on mathematical statistics is the third of 
the series of Cams Mathematical Monographs. The 
purpose of the monographs, admirably expressed by 
Professor Bliss in the first book of the series, is "to make 
the essential features of various mathematical theories 
more accessible and attractive to as many persons as 
possible who have an interest in mathematics but who 
may not be specialists in the particular theory presented. '* 

The problem of making statistical theory available 
has been changed considerably during the past two or 
three years by the appearance of a large number of text- 
books on statistical methods. In the course of preparation 
of the manuscript of the present volume, the writer felt 
at one time that perhaps the recent books had covered 
the ground in such a way as to accomplish the main pur- 
poses of the monograph which was in process of prepara- 
tion. But further consideration gave support to the view 
that although the recent books on statistical method will 
serve useful purposes in the teaching and standardization 
of statistical practice, they have not, in general, gone far 
toward exposing the nature of the underlying theory, 
and some of them may even give misleading impressions 
as to the place and importance of probability theory in 
statistical analysis. 

It thus appears that an exposition of certain essential 
features of the theory involved in statistical analysis 
would conform to the purposes of the Cams Mathemati- 
cal Monographs, particularly if the exposition could be 


made interesting to the general mathematical reader. 
It is not the intention in the above remarks to imply a 
criticism of the books in question. These books serve 
certain useful purposes. In them the emphasis has been 
very properly placed on the use of devices which facili- 
tate the description and analysis of data. 

The present monograph will accomplish its main 
purpose if it makes a slight contribution toward shifting 
the emphasis and point of view in the study of statistics 
in the direction of the consideration of the underlying 
theory involved in certain highly important methods of 
statistical analysis, and if it introduces some of the re- 
cent advances in mathematical statistics to a wider range 
of readers. With this as our main purpose it is natural 
that no great effort is being made to present a well- 
balanced discussion of all the many available topics. 
This will be fairly obvious from omissions which will be 
noted in the following pages. For example, the very im- 
portant elementary methods of description and analysis 
of data by purely graphic methods and by the use of 
various kinds of averages and measures of dispersion are 
for the most part omitted owing to the fact that these 
methods are so available in recent elementary books that 
it seems tuinecessary to deal with them in this mono- 
graph. On the other hand, topics which suggest making 
the underlying theories more available are emphasized. 

For the purpose of reaching a relatively large number 
of readers, we are fortunate in that considerable portions 
of the present monograph can be read by those who have 
relatively little knowledge of college mathematics. How- 
ever, the exposition is designed, in general, for readers of 
a certain degree of mathematical maturity, and presup- 


poses an acquaintance with elementary differential and 
integral calculus, and with the elementary principles of 
probability as presented in various books on college alge- 
bra for freshmen. 

A brief list of references is given at the end of Chapter 
VII. This is not a bibliography but simply includes books 
and papers to which attention has been directed in the 
course of the text by the use of superscripts. 

The author desires to express his special indebtedness 
to Professor Burton H. Camp who read critically the 
entire manuscript and made many valuable suggestions 
that resulted in improvements. The author is also in- 
debted to Professor A. R. Crathorne for suggestions on 
Chapter I and to Professor E. W. Chittenden for certain 
suggestions on Chapters II and III. Lastly, the author 
is deeply indebted to Professor Bliss and to Professor 
Curtiss of the Publication Committee for important 
criticisms and suggestions, many of which were made with 
special reference to the purposes of the Carus Mathe- 
matical Monographs. 

Henry L. Rietz 

The University of Iowa 
December, 1926 



I. The Nature of the Problems and Underlying 
Concepts of Mathematical Statistics .... 1 

1. The scope of mathematical statistics 

2. Historical remarks 

3. Two genera! types of problems 

4. Relative frequency and probability 

5. Observed and theoretical frequency distributions 

6. The arithmetic mean and mathematical expecta- 

7. The mode and the most probable value 

8. Moments and the mathematical expectations of 
powers of a variable 

II. Relative Frequencies in Simple Sampling ... 22 

9. The binomial description of frequency 

10. Mathematical expectation and standard deviation 
of the number of successes 

11. Theorem of Bernoulli 

12. The De Moivre-Laplace theorem 

13. The quartile deviation 

14. The law ol small probabilities. The Poisson expo- 
nential function 

III. Frequency Functions of One Variable .... 46 

15. Introduction 

16. The Pearson system of generalized frequency 

17. Generalized normal curves — Gram-Charlier series 

18. Remarks on the genesis of Type A and Type B 

19. The coefficients of the Type A series expressed in 
moments of the observed distribution 

20. Remarks on two methods of determining the co- 
efficients of the Type A series 



21. The coefficients of the Type B series 

22. Remarks 

23. Skewness 

24. Excess 

25. Remarks on the distribution of certain transformed 

26. Remarks on the use of various frequency functions 
as generating functions in a series representation 

IV. Correlation 77 

27. Meaning of simple correlation 

28. The regressive method and the correlation surface 
method of describing correlation 

29. The correlation coefficient 


30. Linear regression 

31. The standard deviation of arrays — mean square 
error of estimate 

32. Non-linear regression — the correlation ratio 

33. Multiple correlation 

34. Partial correlation 

35. Non-linear regression in n variables — multiple cor 
relation ratio 

36. Remarks on the place of probability in the regres- 
sion method 


37. Normal correlation surfaces 

38. Certain properties of normally correlated dis- 

39. Remarks on further methods of characterizing 

V. On Random Sampling Fluctuations 1 14 

40. Introduction 

41. Standard error and correlation of errors in dass 



42. Remarks on the assumptions involved in the deriva- 
tion of standard errors 

43. Standard error in the arithmetic mean and in a 
^th moment coefficient about a fixed point 

44. Standard error of the ^th moment fx^ about a mean 

45. Remarks on the standard errors of various statis- 
tical constants 

46. Standard error of the median 

47. Standard deviation of the sum of independent 

48. Remarks on recent progress with sampling errors of 
certain averages obtained from small samples 

49. The recent generalizations of the Bienayme- 
Tchebycheff criterion 

50. Remarks on the sampling fluctuations of an ob- 
served frequency distribution from the underlying 
theoretical distribution 

VI. The Lexis Theory 146 

51. Introduction 

52. Poisson series 

53. Lexis series 

54. The Lexis ratio 

Vll A Development of the Gram-Charlier Series . . 156 

55. Introduction 

56. On a development of Type A and Type B from the 
law of repeated trials 

57. The values of the coefficients of the Type A series 
obtained from the biorthogonal property 

58. The values of the coefficients of type A series ob- 
tained from a least-squares criterion 

59. The coefficients of a Type B series 

Notes 173 

Index 178 





1. The scope of mathematical statistics. The bounds 
of mathematical statistics are not sharply defined. It is 
not uncommon to include under mathematical statistics 
such topics as interpolation theory, approximate integra- 
tion, periodogram analysis, index numbers, actuarial 
theory, and various other topics from the calculus of ob- 
servations. In fact, it seems that mathematical statistics 
in its most extended meaning may be regarded as includ- 
ing all the mathematics appUed to the analysis of quanti- 
tative data obtained from observation. On the other 
hand, a number of mathematicians and statisticians have 
implied by their writings a limitation of mathematical 
statistics to the consideration of such questions of fre- 
quency, probability, averages, mathematical expectation, 
and dispersion as are likely to arise in the characterization 
and analysis of masses of quantitative data. Borel has 
expressed this somewhat restricted point of view in his 
statement^ that the general problem of mathematical sta- 
tistics is to determine a system of drawings carried out 
with urns of fixed composition, in such a way that the 
results of a series of drawings lead, with a very high degree 
of probability, to a table of values identical with the table 
of observed values. 

* For footnote references, see pp. 173-77. 


On account of the different views concerning the 
boundaries of the field of mathematical statistics there 
arose early in the preparation of this monograph ques- 
tions of some difficulty in the selection of topics to be in- 
cluded. Although no attempt will be made here to answer 
the question as to the appropriate boundaries of the field 
for all purposes, nevertheless it will be convenient, partly 
because of limitations of space, to adopt a somewhat re- 
stricted view with respect to the topics to be included. To 
be more specific, the exposition of mathematical statistics 
here given will be limited to certain methods and theories 
which, in their inception, center around the names 
of Bernoulli, De Moivre, Laplace, Lexis, Tchebycheff, 
Gram, Pearson, Edgeworth, and Charlier, and which have 
been much developed by other contributors. These meth- 
ods and theories are much concerned with such concepts 
as frequency, probability, averages, mathematical expec- 
tation, dispersion, and correlation. 

2. Historical remarks. While we are currently experi- 
encing a period of special activity in mathematical statis- 
tics which dates back only about forty years, some of the 
concepts of mathematical statistics are by no means of 
recent origin. The word "statistics" is itself a compara- 
tively new word as shown by the fact that its first occur- 
rence in English thus far noted seems to have been in J. F. 
von Bielfeld, The Elements of Universal Eruditiony trans- 
lated byW. Hooper, London, 1770. Notwithstanding the 
comparatively recent introduction of the word, certain 
fundamental concepts of mathematical statistics to which 
attention is directed in this monograph date back to the 
first publication relating to Bernoulli's theorem in 1713. 
The line of development started by Bernoulli was carried 


forward by Stirling (1730), De Moivre (1733), Euler 
(1738), and Maclaurin (1742), and culminated in the 
formulation of the probability theory of Laplace. The 
Theorie Analytique des Probabilites of Laplace published 
in 1812 is the most significant pubKcation underlying 
mathematical statistics. For a period of approximately 
fifty years following the publication of this monumental 
work there was relatively little of importance contributed 
to the subject. While we should not overlook Poisson's 
extension of the Bernoulli theory to cases where the prob- 
ability is not constant, Gauss's development of methods 
for the adjustment of observations, Bravais's extension of 
the normal law to functions of two and three variables, 
Quetelet's activities as a popularizer of social statistics, 
nevertheless there was on the whole in this period of fifty 
years little progress. 

The lack of progress in this period may be attributed 
to at least three factors: (1) Laplace left many of his re- 
sults in the form of approximations that would not readi- 
ly form the basis for further development; (2) the follow- 
ers of Gauss retarded progress in the generalization of fre- 
quency theory by overpromoting the idea that deviations 
from the normal law of frequency are due to lack of data; 
(3) Quetelet overpopularized the idea of the stability of 
certain striking forms of social statistics, for example, the 
stability of the number of suicides per year, with the 
natural result that his activities cast upon statistics a 
suspicion of quackery which exists even to some extent 
at present. 

An important step in advance was taken in 1877 in 
the publication of the contributions of Lexis to the classi- 
fication of statistical distributions with respect to normal, 


supemonnal, and subnormal dispersion. This theory will 
receive attention in the present monograph. 

The development of generalized frequency curves and 
the contributions to a theory of correlation from 1885 to 
1900 started the period of activity in mathematical statis- 
tics in which we find ourselves at present. The present 
monograph deals largely with the progress in this period, 
and with the earlier underlying theory which facilitated 
relatively recent progress. 

3. Two general types of problems. For purposes of 
description it seems convenient to recognize two general 
classes of problems with which we are concerned in mathe- 
matical statistics. In the problems of the first class our 
concern is largely with the characterization of a set of 
numerical measurements. or estimates of some attribute 
or attributes of a given set of individuals. For example, 
we may establish the facts about the heights of 1,000 
men by finding averages, measures of dispersion, and 
various statistical indexes. Our problem may be limited 
to a characterization of the heights of these 1,000 

In the problems of the second class we regard the data 
obtained from observation and measurement as a random 
sample drawn from a well-defiLQed class of items which 
may include either a limited or an imlimited supply. Such 
a well-defined class of items may be called the "popula- 
tion" or universe of discourse. We are in this case con- 
cerned with using the properties of a random sample of 
variates for the purpose of drawing inferences about the 
larger population from which the sample was drawn. For 
example, in this class of problems involving the heights 
of the 1,000 men we would be concerned with the ques- 


tion: What approximate or probable inferences may be 
drawn about the statures of a whole race of men from an 
analysis of the heights of a sample of 1,000 men drawn at 
random from the men of the race? In dealing with such 
questions, we should in the first place consider the diffi- 
culties involved in drawing a sample that is truly random, 
and in the next place the problem of developing certain 
parts of the theory of probability involved in statistical 

The two classes of problems to which we have directed 
attention are not, however, entirely distinct with regard 
to their treatment. For example, the conceptions of prob- 
able and standard error may be used both in describing 
the facts about a sample and in indicating the probable 
degree of precision of inferences which go beyond the ob- 
served sample by dealing with certain properties of the 
population from which we conceive the sample to be 
drawn. Moreover, a satisfactory description of a sample 
is not likely to be so purely descriptive as wholly to pre- 
vent the mind from dwelling on the inner meaning of the 
facts in relation to the population from which the sample 
is drawn. 

As a preliminary to dealing in later chapters with cer- 
tain of the problems falling under these two general classes 
we shall attempt in the present chapter to discuss briefly 
the nature of certain underlying concepts. We shall find 
it convenient to consider these concepts in pairs as fol- 
lows: relative frequency and probability; observed and 
theoretical frequency distributions; arithmetic mean and 
mathematical expectation; mode and most probable val- 
ue; moments and mathematical expectations of a power 
of a variable. 


4. Relative frequency and probability. The frequency 
/ of the occurrence of a character or event among 5 possi- 
ble occurrences is one of the simplest items of statistical 
information. For example, any one of the following items 
illustrates such statistical information: Five deaths in a 
year among 1,000 persons aged 30, nearest birthday; 610 
boys among the last 1,200 children born in a city; 400 
married men out of a total of 1,000 men of age 23; twelve 
cases of 7 heads in throwing 7 coins 1,536 times. 

The determination of the numerical values of the rela- 
tive frequencies//^ corresponding to such items is one of 
the simplest problems of statistics. This simple problem 
suggests a fundamental problem concerning the probable 
or expected values of such relative frequencies if s were 
a very large number. When s is a' large number, the rela- 
tive frequency f/s is very commonly accepted in applied 
statistics as an approximate measure of the probability 
of occurrence of the event or character on a given occa- 

To take an illustration from an important statistical 
problem, let us assume that among / persons equally likely 
to live a year we find d observed deaths during the year. 
That is, we assume that d represents the frequency of 
deaths per year among the / persons each exposed for one 
year to the hazards of death. If / is fairly large, the rela- 
tive frequency d/l is often regarded as an approximation 
to what is to be defined as the probability of death of one 
such person within a year. In fact, it is a fundamental 
assumption of actuarial science that we may regard such 
a relative frequency as an approximation to the proba- 
bility of death when a sufficiently large number of persons 
are exposed to the hazards of death. For a numerical illus 


tration, suppose there are 600 deaths among 100,000 per- 
sons exposed for a year at age 30. We accept .006 as an 
approximation to the probability in question at age 30. 
In the method of finding such an approximation we decide 
on a population which constitutes an appropriate class 
for investigation and in which individuals satisfy certain 
conditions as to likeness. Then we depend on observation 
to obtain the items which lead to the relative frequency 
which we may regard as an approximation to the proba- 

For an ideal population, let us conceive an urn con- 
taining white and black balls alike except as to color and 
thoroughly mixed. Suppose further for the present that 
we do not know the ratio of the number of white balls to 
the total number in this urn which we may conceive to 
contain either any finite number or an indefinitely large 
number of balls. This ratio is often called the probability 
of drawing a white ball. When the number in the urn is 
finite, we make drawings at random consisting of j balls 
taken one at a time with replacements to keep the ratio 
of the numbers of white and black balls constant. If we 
may assume the number in the urn to be infinite, the 
drawings ma)^ under certain conditions be made without 
replacements. Suppose we obtain/ white balls as a result 
of thus drawing 5 balls, then we say that//5 is the relative 
frequency with which we drew white balls. When s is 
large, this relative frequency would ordinarily give us an 
approximate value of the probability of drawing a white 
ball in one trial, that is, an approximate value of the 
ratio of white balls to the total number of balls in the 

Thus far we have not defined probability, but have 


presented illustrations of approximations to probabilities. 
While these illustrations seem to suggest a definition, it 
is nevertheless difficult to frame a definition that is satis- 
factory and includes all forms of probability. The need 
for the concepts of relative frequency and probability in 
statistics arises when we are associating two events such 
that the first may be regarded as a trial and the second 
may be regarded as a success or a failure depending on 
the result of the trial. The relative frequency of success 
is then the ratio of the number of successes to the total 
number of trials. 

// the relative frequency of success approaches a limit 
when the trial is repeated indefinitely under the same set of 
circumstances y this limit is called the probability of success 
in one trial. 

There are some objections to this definition of proba- 
bility as well as to any other that we could propose. One 
objection is concerned with questioning the validity of 
the assumption that a limit of the relative frequency 
exists, and another relates to the meaning of the expres- 
sion, "the same set of circumstances." That the limit 
exists is an empirical assumption whose validity cannot 
be proved, but experience with data in many fields has 
given much support to the reasonableness and usefulness 
of the assumption. The objection based on the difficulty 
of controlling conditions so as to repeat the trial under the 
same set of circumstances is an objection that could be 
brought against experimental science in general with re- 
spect to the difficulties of repeating experiments under 
the same circumstances. The experiments are repeated as 
nearly as circumstance permits. 

It seems fairly obvious that the development of sta- 


tistical concepts is approached more naturally from this 
limit definition than from the familiar definitions suggest- 
ed by games of chance. However, we shall at certain 
points in our treatment (for example, see § 11) give 
attention to the fact that various definitions of proba- 
bility exist in which the assumptions dififer from those 
involved in the above definition. The meaning of proba- 
bility in statistics is fairly well expressed for some 
purposes by any one of the expressions, theoretical 
relative frequency, presumptive relative frequency, or ex- 
pected value of a relative frequency. Indeed, we some- 
times express the fact that the relative frequency f/s 
is assumed to have the probabiKty /> as a limit when 
5->oo in abbreviated form by writing E{f/s)—py where 
E{f/s) is read, "expected value oif/sJ* It is fairly clear 
that in our definition of probability we simply ideal- 
ize actual experience by assuming the existence of a limit 
of the relative frequency. This idealization, for purposes 
of definition, is in some respects analogous to the ideal- 
ization of the chalk mark into the straight line of 

In certain cases, notably in games of chance or urn 
schemata, the probability may be obtained without col- 
lecting statistical data on frequencies. Such cases arise 
when we have urn schemata of which we know the ratio 
of the number of white balls to the total number. For 
example, suppose an urn contains 7 white and 3 black 
balls and that we are to inquire into the probability that 
a ball to be drawn will be white. We could experiment by 
drawing one ball at a time with replacements until we 
had made a very large number of drawings and then esti- 
mate the probability from the ratio of the number of 


white balls to the total number of balls drawn. It would 
however in this case ordinarily be much more convenient 
and satisfying to examine the balls to note that they are 
alike except as to color and then make certain assump- 
tions that would give us the probability without actually 
making the trials. 

Thus, when all the possible ways of drawing the balls 
one at a time may be analyzed into 10 equally likely ways, 
and when 7 of these 10 ways give white balls, we assume 
that 7/10 is the probability that the ball to be drawn in 
one trial will be white. This simple case illustrates the 
following process of arriving at a probability: 

// all of an aggregate of ways of obtaining successes and 
failures can be analyzed into s^ possible mutually exclusive 
ways each of which is equally likely; and iff of these ways 
give successes, the probability of a success in a single trial 
may be taken to be f js' , 

Thus in throwing a single die, what is the probability 
of obtaining an ace? We assume that there are 6 equally 
likely ways in which the die may fall. One of these ways 
gives an ace. Hence, we say 1/6 is the probability of 
throwing an ace. A probability whose value is thus ob- 
tained from an analysis of ways of occurrence into sets of 
equally likely cases and a segregation of the cases in which 
a success would occur is sometimes called an a priori 
probability, while a probability whose approximate value 
is obtained from actual statistical data on repeated trials 
is called an a posteriori or statistical probability. 

In making an analysis to study probabilities, difficult 
questions arise both as to the meaning and fulfilment of 
the condition that the ways are to be "equally likely." 
These questions have been the subject of lively debates 


by mathematicians and philosophers since the time of 
Laplace. It has been fairly obvious that the expression 
''equally likely ways" implies as a necessary condition 
that we have no information leading us to expect the 
event to occur in one of two ways rather than in the other, 
but serious doubt very naturally arises as to the suffi- 
ciency of this condition. In fact, it is fairly clear that lack 
of information is not sufficient. For example, lack of in- 
formation as to whether a spinning coin is symmetrical 
and homogeneous does not assist one in passing on the 
validity of the assumption that it is equally likely to turn 
up head or tail. It is when we have all available relevant 
information on such matters as symmetry and homoge- 
neity that we have a basis for the inference that the two 
ways are equally likely, or not equally likely. Similarly, 
lack of information about two large groups of men of age 
30 would not assist us in making the inference that the 
mortality rates or probabilities of death are approximate- 
ly equal for the two groups. On the other hand, relevant 
information in regard to the results of recent medical 
examinations, occupations, habits, and family histories 
would give support to certain inferences or assumptions 
concerning the equality or inequality of the mortality 
rates for the two groups. 

5. Observed and theoretical frequency distributions. 
In many statistical investigations, it is convenient to par- 
tition the whole group of observations into subgroups or 
classes so as to show the number or frequency of observa- 
tions in each class. Such an exhibit of observations is 
called an "observed frequency distribution." As illustra- 
tions we present the following, where the rows marked F 
are the observed frequency distributions: 


Example i. A= lengths of ears of corn in inches. 

A.... 3 4.5 6.0 7.5 9.0 10.5 12.0 
P.... 1 3 20 63 170 67 3 

Example 2. A = prices of commodities for 1919 rela- 
tive to price of 1913 as a base. 

^. .62 87 112 137 162 187 212 237 262 287 312 337 362 387 412 437 462 
F.. \ 1 5 16 39 66 61 36 38 24 9 3 3 3 1 

Example j. A= heights of men in inches. 

^.... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 
F.... 2 10 11 38 57 93 106 126 109 87 75 23 9 4 

In Example 1 the whole group of ears of corn is ar- 
ranged in classes with respect to length of ears. The class 
interval in this case is taken to be one and one-half inches. 
In Example 2 the class interval is a number, twenty-five; 
in Example 3, it is one inch. 

If the variable x takes values Xi, X2j . . . . , Xn with 
the corresponding probabilities />i, />2, • • . . , pn, we call 
the system of values Xi, X2, . . . , , Xn and the associated 
probabilities or numbers proportional to them, the theo- 
retical frequency distribution of the variable x. Thus, we 
may write for the theoretical frequency distribution of 
the number of heads in throwing three coins: 

Heads 12 3 

Probabilities 1/8 3/8 3/8 1/8 

Theoretical frequencies. . 13 3 1 

When for a given set of values of a variable x there exists 
a function F(x) such that the ratio of the number of values 
of X on the interval ab to the number on the interval a^b' 
is the ratio of the integrals 

CFix)dx : j 

F(x)dx , 



for all choices of the intervals ab and a'b\ then F(x) is 
called the frequency function y or the probability density, or 
the law of distribution of the values of x. The curve 
y = F(x) is called a theoretical frequency curve, or more 
briefly the frequency curve. 

To devise methods for the description and characteri- 
zation of the various types of frequency distributions 
which occur in practical problems of statistics is clearly 



































61 fi3 63 (H 65 M S7 68 (>9 70 71 7i 73 74 75 
Fig. 1 Showing frequency polygon and free-hand fire- 
qaency curve of the dibtributioa of heights of men la 
Example 3. 

of fundamental importance. Such a description or charac- 
terization may be effected with various degrees of refine- 
ment ranging all the way from one extreme with a simple 
frequency polygon or freehand curve (Fig. 1) representing 
frequencies by ordinates, to a description at the other 
extreme by means of a theoretical frequency curve 
grounded in the theory of probability. 

It is fairly obvious that the latter type of description 
is likely to be much more satisfactory than the former 
because a deeper meaning is surely given to an observed 
distribution if we can effectively describe it by means of 


a theoretical frequency curve than if we can give only a 
freehand or an empirical curve as the approximate repre- 
sentation. However, we should not overlook the fact that 
the description by means of a theoretical curve may be 
too ponderous and laborious for the particular purpose 
of an analysis. Indeed, the use of the theoretical curve is 
likely to be justified in a large way only when it facilitates 
the study of the properties of the class of distributions of 
which the given one is a random sample by enabling us 
to make use of the properties of a mathematical function 
F{x) in establishing certain theoretical norms for the de- 
scription of a class of actual distributions. As important 
supplements to the purely graphic method, we may de- 
scribe the frequency distribution by the use of averages, 
measures of dispersion, skewness, and peakedness. Such 
descriptions facilitate the comparison of one distribution 
with another with respect to certain features. 

6. The arithmetic mean and mathematical expecta- 
tion. The arithmetic mean (AM) of n numbers is simply 
the sum of the numbers divided by w. That is, the arith- 
metic mean of the numbers 

^1, X2f . . . . , Xn 

is given by the formula 

Xi-\-X2-{' • • • • -VXn 

(1) AM = 

The AM is thus what is usually meant by the terms 
''mean," "average," or "mean value" when used without 
further qualification. If the values Xi^ X2, . . , . , Xn occur 
with corresponding frequencies /i, /2, . . . . , /», respec- 


tively, where /1+/2+ . . . . + fn = s, then it follows 
from (1) that the arithmetic mean is given by 

(2) AM = 

flXl-\-f2X2-{- ' ' • ' +fnXf^ 





The arithmetic mean given by (2) is sometimes called 
a "weighted arithmetic mean" where /i,/2, ....,/„ are 
the weights of the values Xi, x^, . , . . j Xn, respectively, 
and (3) may similarly be regarded as a weighted arith- 
metic mean, where 

fl/Sy h/sy , fn/s 

are the weights of Xi, 0:2, .... , Xn, respectively. 

For our present purpose it is important to note that 
the coefficients of Xi, X2, . . . . , Xn'm (3) are the rela- 
tive frequencies of occurrence of these values. By defini- 
tion of statistical probabilities, the limiting value of ft/s 
as s increases indefinitely is pt, where pt is the assumed 
probability of the occurrence of a value Xt ai^ong a set 
of mutually exclusive values Xi^ 0^2, .... , Xn. Hence, as 
the number of cases considered becomes infinite, the arith- 
metic mean would approach a value given by 

(4) AM-=-pxXx-\-p2X2-\r • • • • -^-pnXn , 


where the probabilities />i, p2, » » * - y pn may be regard- 
ed as the weights of the corresponding values 

The mathematical expectation of the experimenter or 
the expected value of the variable is a concept that has 
been much used by various continental European writers 
on mathematical statistics. Suppose we consider the 
probabilities Pi, pi^ . . . . , pn of w mutually exclusive 
events £i, jEj, . . . , -£„, so that /'iH-/'2+ • • • • +/>« = !. 
Suppose that the occurrence of one of these, say £< , on 
a given occasion yields a value a;« of a variable x. Then 
the mathematical expectation or expected value E{x) of 
the variable x which takes on values Xi, x^j , . , , Xn 
with the probabilities pi^ /'2, . . . . , pn, respectively, 
may be defined as 

(5) E{x)^piXi-\-p2Xi'\ • -\-pnXn • 

We thus note by a comparison of (4) and (5) the identity 
of the limit of the mean value and the mathematical ex- 

Furthermore, in dealing with a theoretical distribution 
in which pt is the probability that a variable x assumes a 
value Xt among the possible mutually exclusive values 
xi, X2y , , , , , Xn, and pi+p2-\- • • • • +/>» = !, we have 

(6) AM = piXi + p2X2-\' • • • • -\-pnXn • 

That is, the mathematical expectation of a yariable x and 
its mean value from the appropriate theoretical distribu- 
tion are identical. While there are probably differences 
of opinion as to the relative merits of the language involv- 
ing mathematical expectation or expected value in com- 


parison with the language which uses the mean value of 
a theoretical distribution, or mean value as the number 
of cases becomes infinite, the language of expectation 
seems the more elegant in many theoretical discussions. 
For the discussions in the present monograph we shall 
employ both of these types of language. 

7. The mode and the most probable value. The mode 
or modal value of a variable is that value which occurs 
most frequently (that is, is most fashionable) if such a 
value exists. 

Rough approximations to the mode are used consider- 
ably in general discourse. To illustrate, the meaning of 
the term "average" as frequently used in the newspapers 
in speaking of the average man seems to be a sort of crude 
approximation to the mode. That is, the term "average'* 
in this connection usually implies a type which occurs 
oftener than any other single type. 

The mode presents one of the most striking character- 
istics of a frequency distribution. For example, consider 
the frequency distribution of ears of corn with respect 
to rows of kernels on ears as given in following table: 

A 10 12 14 16 18 20 22 24 

F 1 16 109 241 235 116 41 10. 

where A = number of rows of kernels and F = frequency. 
It may be noted that the frequency increases up to the 
class with 16 rows and then decreases. The mode in rela- 
tion to a frequency distribution is a value to which there 
corresponds a greater frequency than to values just pre- 
ceding or immediately following it in the arrangement. 
That is, the mode is the value of the variable for which 
the frequency is a maximum. A distribution may have 
more than one maximum, but the most common types of 


frequency distributions of both theoretical and practical 
interest in statistics will be found to have only one mode. 

The expression "most probable value" of the number 
of successes in s trials is used in the general theory of 
probability for the number to which corresponds a larger 
probability of occurrence than to any other single number 
which can be named. For example, in throwing 100 coins, 
the most probable number of heads is 50, because 50 is 
more likely than any other single number. 

This does not mean, however, that the probability of 
throwing exactly 50 heads is large. In fact, it is small, but 
nevertheless greater than the probability of throwing 49 
or any other single number of heads. In other words, the 
most probable value is the modal value of the appropriate 
theoretical distribution. 

8. Moments and the mathematical expectations of 
powers of a variable. With observed frequencies /i, /2, 
....,/„ corresponding to Xi, ^2, . . . . , Xn, respectively, 
and with /1+/2+ • • • • +fn = s, the ^th order moment , per 
unit frequency J is defined as 


(7) Mi=-\^M, 

which is the arithmetic mean of the ^th powers of the 
variates. For the sake of brevity, we shall ordinarily use 
the word "moment" as an abbreviation for "mcment per 
unit frequency," when this usage will lead to no misunder- 
standing of the meaning. 

Consider a theoretical distribution of a variable x tak- 
ing values Xt{t = \, 2, . . . . , n). Let the corresponding 
probabilitiesof occurrence /><(/ = !, 2, . . . . , «) be repre- 


sented as ^'-ordinates. Then the moment of order k of the 
ordinates about the ^^-axis is defined as 


(8) Mi=2)Mf. 


The mathematical expectation of the ^th power of x is 
likewise defined as the second member of this equality so 
that the y^th moment of the theoretical distribution and 
the mathematical expectation of the ^th power of the 
variable x are identical. 

When we have a theoretical distribution ranging from 
x = a to x = h, and given by a frequency function (p. 13) 
y = F(x)j we write in place of (8) 

M*= Cx'F{x)dx, 


where F{x)dx gives, to within infinitesimals of higher or- 
der, the probability that a value of x taken at random falls 
in any assigned interval x to x-^dx. 

When the axis of moments is parallel to the y-axis 
and passes through the arithmetic mean or centroid x of 
the variable x, the primes will be dropped from the /i's 
which denote the moments. Thus, we write 

(9) i^k=\^ft{x,-xy^\'Y,a^t-i^i) 


where the arithmetic mean of the values of x is x = ix[. 

The square root of the second moment M2 about the 

arithmetic mean is called the standard deviation and is 


very commonly denoted by <7. That is, the standard de- 
viation is the root-mean-square of the deviations of a set 
of numbers from their arithmetic mean. In the language 
of mechanics, o- is the radius of gyration of a set of 5 equal 
particles, with respect to a given centroidal axis. 

It is often important to be able to compute the mo- 
ments about the axis through the centroid from those 
about an arbitrary parallel axis. For this purpose the fol- 
lowing relations are easily established by expanding the 
binomial in (9) and then making some slight simplifica- 

/xo = Mo=l, Mi = 0, M2 = M2 — M?, 

M3 = M3 — 3iUiM2+2Mi^ , 

M4 = M4-4/xi/x5+6MlV2-3/iiS 

» = ^ ^ 


\i) iHn-t 


is the number of combinations of n things taken i at a 

These relations are very useful in certain problems 
of practical statistics because the moments ju* (^ = 1, 2, 
. . . .) are ordinarily computed first about an axis con- 
veniently chosen, and then the moments ju* about the 


parallel centroidal axis may be found by means of the 
above relations. In particular, tii — li — lA^ expresses the 
very important relation that the second moment ni about 
the arithmetic mean is equal to the second moment jiia 
about an arbitrary origin diminished by the square /xi^ 
of the arithmetic mean measured from the arbitrary 
origin. This is a familiar proposition of elementary me- 
chanics when the mean is replaced by the centroid. 

When we-pass from (9) to corresponding expectations, 
the relation M2 = M2 — Mi^, written in the form /4 = Mi^+M2, 
tells us that the expected value, E{x^), of x^ is equal to 
the square, [£(rt:)]^ of the expected value of x increased 
by the expected value, E\^x — E{x)Y\, of the square of the 
deviations of x from its expected value. 


9. The binomial description of frequency. In Chap- 
ter I attention was directed to the very simple process of 
finding the relative frequency of occurrence of an event 
or character.among s cases in question. Let us now con- 
ceive of repeating the process of finding relative fre- 
quencies on many random samples each consisting of s 
items drawn from the same population. To characterize 
the degree of stability or the degree of dispersion of such 
a series of relative frequencies is a fundamental statistical 

To illustrate, suppose we repeat the throwing of a set 
of 1,000 coins many times. An observed frequency dis- 
tribution could then be exhibited with respect to the 
number of heads obtained in each set of 1,000, or with 
respect to the relative frequency of heads in sets of 1,000. 
Such a procedure would be a laborious experimental treat- 
ment of the problem of the distribution of relative fre- 
quencies from repeated trials. What we seek is a mathe- 
matical method of obtaining the theoretical frequency dis- 
tribution with respect to the number of heads or with 
respect to the relative frequency of heads in the sets. 

To consider a more general problem, suppose we draw 
many sets of 5 balls from an urn one at a time with re- 
placements, and let p be the probability of success in 
drawing a white ball in one trial. The problem we set is 
to determine the theoretical frequency distribution with 



respect to the number of white balls per set of s, or with 
respect to the relative frequency of white balls in the sets. 
To consider this problem, let q be the probability of 
failure to draw a white ball in one trial so that p+q = \. 
Then the probabilities of exactly w = 0, 1, 2, . . . . , 5 
successes in s trials are given by the successive terms of 
the binomial expansion 



{qi-py = qs^spq^-'+ (J^ p'q^-^ 

(s\/ s \^_A 
\m) \s-mj fnl(s- 


Derivations of this formula for the probability of m 
successes in s trials from certain definitions of probability 
are given in books on college algebra for freshmen. For a 
derivation starting from the definition of probability as a 
limit, the reader is referred to Coolidge.^ A frequency 
distribution with class frequencies proportional to the 
terms of (1) is sometimes called a Bernoulli distribution. 
Such a theoretical distribution shows not only the most 
probable distribution of the drawings from an urn, as 
described above, but it serves also as a norm for the dis- 
tribution of relative frequencies obtained from some of 
the simplest sampling operations in applied statistics. For 
example, the geneticist may regard the Bernoulli dis- 
tribution (1) as the theoretical distribution of the rela- 
tive frequencies m/s of green peas which he would obtain 

* See references on pp. 173-77. 


among random samples each consisting of a yield of s 
peas. The biologist may regard (1) as the theoretical dis- 
tribution of the relative frequencies of male births in 
random samples of s births. The actuary may regard (1) 
as the theoretical distribution of yearly death-rates in 
samples of s men of equal ages, say of age 30, drawn from 
a carefully described class of men. In this case we 

specify that the samples 
shall be taken from a care- 
fully described class of 
men because the assump- 
tions involved in the urn 
schemata underlying a 
Bernoulli distribution do 
not permit a careless se- 
lection of data. Thus, it 
would not be in accord 
with the assumptions to take some of the samples from 
a group of teachers with a relatively low rate of mortality 
and others from a group of anthracite coal miners with 
a relatively high rate of mortality. 

The fact stated at the beginning of this section that 
we are concerned with repeating the process of drawing 
from the same population is intended to imply that the 
same set of circumstances essential to drawing a random 
sample shall exist throughout the whole series of draw- 

The expression "simple sampling'' is sometimes ap- 
plied to drawing a random sample when the conditions 
for repetition just described are fulfilled. In other words, 
simple sampling implies that we may assume the imder- 
lying probability p of formula (1) remains constant from 


sample to sample, and that the drawings are mutually 
independent in the sense that the results of drawings do 
not depend in any significant manner on what has hap- 
pened in previous drawings. 

In Figure 2 the ordinates at a: = 0, 1, 2, . . . . ,7 show 
the values of terms of (1) for /> = g = l/2, 5 = 7. To find 
the "most probable" or modal number of successes 
m' in s trials, we seek the value oi m = m' which gives a 
maximum term of (1). To find this value of m, we write 
the ratios of the general term of (1) to the preceding and 
the succeeding terms. The first ratio will be equal to or 
greater than imity when 

s-m±\p ^^^ 

m q r I r 

In the same way, the second ratio will be equal to or 
greater than unity when 

T^l or m>P5 — q, 

s—m p ~^ ^ 

We have, thus, the integer m=rn/ which gives the modal 
value determined by the inequalities, 

ps—q^m'^ps+p . 

We may say therefore that, neglecting a proper fraction, 
ps is the most probable or modal number of successes. 
When ps — q and ps-\-p are integers, there occur two equal 
terms in (1) each of which is larger than any other term 
of the series. For example, note the equality of the first 
and second terms of the expansion (5/6+1/6)^. 


10. Mathematical expectation and standard deviation 
of the number of successes. Let fh be the mathematical 
expectation of the number of successes, or what is the 
same thing, the arithmetic mean number of successes in 
s trials under the law of repeated trials as defined by 
formula (1) on page 23. We shall now prove that fn = ps. 

By definition (§6), 


_ "ST^ si 


m = 


(3) =spy 7 \^ ^.p^-'q'-^^sp 


f«=l ^ ^ 

Let d = m — sp be the discrepancy of the number of 
successes from the mathematical expectation, and let a^ 
be the mathematical expectation of the square of the dis- 
crepancy. By definition, 

(r2 = V -—j^ — - p^ q'-^{m - spy 

m = 

(*) ^ 2 ,;dFr^^'" «'""■('"' -2'«^^+''^>- 

„ {s-m) 

m = 


We shall now prove that a^ = spq. To do this, we 
write w2 = w+w(w — 1) and obtain for the first term of 
(4) the value 


'^nd is 



ml(s — m)\ 

= sp+s{s-l)f . 
From (2), (3), (4), and (5), we have 


f^2^Sp + sis-l)p^-25^p^ + sY 

= sp{l-p)=spq . 

The measure of dispersion a is often called the 
standard deviation of the frequency of successes in the 

Next, we define d/s = {m/s)—p as the relative dis- 
crepancy, for it is the difference between the probability of 
success and the relative frequency of success. The mean 
square of the relative discrepancy is the second member 
of equation (4) divided by s^. It is clearly equal to the 
mean square a^ of the discrepancy divided by s^, which 

(7) ?. 

The theoretical value of the standard deviation of the 
relative frequency of successes is then (pq/sY^^. 

11. Theorem of Bernoulli. The theorem of Bernoulli 
deals with the fundamental problem of the approach of 
the relative frequency m/s of success in s trials to the 


underlying constant probability p sls s increases. The 
theorem may be stated as follows: 

In a set of s trials in which the chance of a success in 
each trial is a constant p, the pr oh ability P of the relative 
discrepancy {m/s)—p being numerically as large as any as- 
signed positive number e "will approach zero as 'a limit as the 
number of trials s increases indefinitely ^ and the probability , 
Q = l—F, of this relative discrepancy being less than e ap- 
proaches 1 or certainty. 

This theorem is sometimes called the law of large 
numbers. The theorem has been very commonly regarded 
as the basic theorem of mathematical statistics. But with 
the definition of probability (p. 8) as the limit of the rela- 
tive frequency, this theorem is an immediate consequence 
of the definition. While it adds to the definition some- 
thing about the manner of approach to the limit the 
theorem is in some respects not so strong as the corre- 
sponding assumption in the definition. 

With a definition of probability other than the limit 
definition, the theorem may not follow so readily. It has 
been regarded as fundamental because of its bearing on 
the use of the relative frequency m/s {s large) as if it 
were a close approximation to the probability p. Assum- 
ing for the present that we have any definition of the 
probability p of success in one trial from, which we reach 
the law of repeated trials given in the binomial expansion 
(1), we may prove the Bernoulli theorem by the use of 
the Bienayme-Tchebycheff criterion. 

To derive this criterion, consider a statistical variable 
X which takes mutually exclusive values Xij x^j . . . . ^x^ 
with probabilities pu P2, - * - - y pn, respectively, where 

pi-\-p2+ +/>,= !. 


Let a be any given number from which we wish to 
measure deviations of the a;'s. A specially important case 
is that in which a is a mean or expected value of x, al- 
though a need not be thus restricted. For the expected 
mean-square deviation from a, we may write 

cr2 = M+M+ +M^, 

where c?, = ic, — a. 

Let d\ d'\ . . . . , be those deviations Xi—a which are 
at least as large numerically as an assigned multiple 
€ = Xa-(X>l) of the root-mean-square deviation <t from 
a, and let p\ p" , . . . . , be the corresponding probabili- 
ties. Then we have 

(8) <r2^/>V'2-f/>"J"2^_ ..... 

Since d\ d" y , . . . , are each numerically equal to or 
greater than Xtr, we have from (8) that 

(T^^XVV-f />''+...•). 

If we now let P{\(t) be the probability that a value of 
X taken at random from the "population" will differ from 
a numerically by as much as Xcr, then F{\a)=p^-\-p" 
+ ...., and (72^XV2P(X(7). Hence 

(9) P(Xcx)^^,. 

To illustrate numerically we may take a to be the arith- 
metic mean of the re's and say that the probability is not 
more than 1/25 that a variate taken at random will 


deviate from the arithmetic mean as much as five times 
the standard deviation. 

A striking property of the Bienayme-TchebychefT 
criterion is its independence of the nature of the distribu- 
tion of the given values. 

In a slightly different form, we may state that the 
probability is greater than 1 - l/X^, that a variate taken 
at random will deviate less than X<7 from the mathe- 
matical expectation. This theorem is ordinarily known 
as the inequality of Tchebycheff,^ but the main ideas 
underlying the inequality were also developed by Bie- 

We shall now turn our attention more directly to the 
theorem of Bernoulli. We seek the probability that the 
relative discrepancy {m/s) — /? will be numerically as large 
as an assigned positive number e. 

We may take e = \{pq/sy^^, a multiple of the theoreti- 
cal standard deviation (pq/sY^^ of the relative frequencies 
m/s. (See §10). 

Let P be the probability that 

then from the Bienayme-Tchebycheff criterion (9), we 
have P^l/X2. 


1 i /^n\ 1/2 pq 

we have P ^ ^ . 



For any assigned e, we may by increasing s make P 
small at pleasure. That is, the probability P that the 
relative frequency m/s will differ from the probability p 
by at least as much as an assigned number, however small, 


tends toward zero as the number of cases s is indefinitely 

For example, if we are concerned with the probability 
P that I {m/s)-p I ^ .001, we see that P ^ l,000,OOO^gA. 
If the number of trials s is not very large, this inequality 
would ordinarily put no important restriction on P. But 
as s increases indefinitely, l,000,000/>g remains constant, 
and l,000,000/>g/5 approaches zero. Again, the probabil- 
ity Q = \—P that I {in/s)—p\ is less than e satisfies the 

(10) 01-S- 

From (10) we see that with any constant pq/^^, the 
probability Q becomes arbitrarily near 1 or certainty as 
s increases indefinitely. Hence the theorem is established 
for any definition of probability from which we derive 
(1) as the law of repeated trials. 

It seems that the statement of the theorem concern- 
ing the probable approach of relative frequencies to the 
underlying probabihty may appear simpler and more ele- 
gant by the use of the concept of asymptotic certainty 
introduced by E. L. Dodd in a recent paper.^ According 
to this concept, we may say it is asymptotically certain 
that m/s will approach /> as a limit as s increases in- 

12. The De Moivre-Laplace Theorem.' The De 
Moivre-Laplace theorem deals with the probability that 
the number of successes w in a set of 5 trials will fall 
within a certain conveniently assigned discrepancy d from 
the mathematical expectation sp. By the inequality of 
Tchebycheff (p. 30) a lower limit to the value of this 
probability has been given. We shall now proceed to con- 


sider the problem of finding at least the approximate 
value of the probability. This problem would, in the 
simplest cases, involve merely the evaluation and addi- 
tion of certain terms of the expansion (1). But this pro- 
cedure would, in general, be impracticable when s is large 
and d even fairly large. To visualize the problem we 
represent the terms of (1) by ordinates y, at unit inter- 
vals where x marks deviations of m from the mathe- 
matical expectation of successes ps as an origin. Then we 

The probability that the number of successes will lie 
within the interval ps—d and ps+d, inclusive of end 

values, is then the sum of the ordinates 


(12) y_d+y_(d_i)+ • • • • +yo+yiH- • • • • +yd= 2^^' ' 


As the number of y's in this sum is likely to be large, 
some convenient method of finding the approximate 
value of the sum will be found useful. In attacking this 
problem, we shall first of all replace the factorials in (11) 
approximately by the first term of Stirling's formula for 
the representation of large factorials. 

This formula^ states that 

(13) „. = „»e-"(2«)>/^(l+J^+23|;^+....). 

To form an idea of the degree of approximation obtained 
by using only the first term of this formula, we may say 
that in replacing n\ by n**e~*'{2'irny^^ we obtain a result 



equal to the true value divided by a number between 1 
and l + l/\On. The use of this first term is thus a sufl5- 
ciently close approximation for many purposes if n is fair- 
ly large. The substitution by the use of Stirling's formula 
for factorials in (11) gives, after some algebraic simplifica- 


To explain further our conditions of approximation to 
(11), we naturally compare any individual discrepancy 
X from the mathematical expectation ps with the standard 
deviation a^ispqY^^. We should note in this connection 
that <T is of order s^^^ if neither p nor q is extremely small. 
This fact suggests the propriety of assuming that s is so 
large that x/s shall remain negligibly small, but that 
x/s'^^^ may take finite values such as interest us most 
when we are making comparisons of a discrepancy with 
the standard deviation. It is important to bear in mind 
that we are for the present dealing with a particular kind 
of approximation. 

Under the prescribed conditions of approximation, we 
shall now examine (14) with a view to obtaining a more 
convenient form for y^. For this purpose, we may write 

[X X^ 0^ 1 




where 4){x) and <t>i(x) are finite because each of them 
represents the sum of a convergent power series when x/s 
is small at pleasure. From (14), (15), and (16), 

2spq s 

where <t>z{x) is clearly finite. 

Now if s is so large that (x/s) <f)3(x) becomes small, we 

1 __£L 

y'- ilirspqy/^ ' '''' 

as an approximation to v, in (11). 

As a first approximation to the sum of the ordinates 
in (12), we then write the integral 

1 r+'^ _j>_ ^ 

(^7) {2rspgy^'^J., ^"^^^^^- 

This integral is commonly known as the probability 
integral. The ordinates of the bell-shaped curve (Fig, 3) 
represent the values of the function 

1 _*L 
^"^(2^0^^ ^'^ 

where (T^ = spq. This curve is the normal frequency curve 
and will be further considered in Chapter III. 

We may increase slightly the accuracy of our ap- 
proximation by taking account of the fact that we have 
one more ordinate in (12) than intervals of area. We may 



therefore appropriately add an ordinate at x = d to the 
value given in (17), and obtain 


e 2spq 




1/2 e 2s pq , 

for the probability that the discrepancy is between 
— d and d inclusive of end points. 

Another method of taking account of the extra 
ordinate is to extend the limits of integration in (17) by 

one-half the unit at both the upper and lower limits. 
That is, we write 



_i n^'^ 

e 2spq dx 

in place of (17). 

We may now state the De Moivre-Laplace theorem: 
Given a constant probability p of success in each of s 
trials where s is a large number, the probability that the 
discrepancy m—sp of the number m of successes from the 
mathematical expectation will not exceed nufnerically a 
given positive number d is given to a first approximation by 
(17) and to closer approximations by (18) and (19). 


Although formulas (17), (18), and (19) assume s 
large, it is interesting to experiment by applying these 
formulas to cases in which s is not large. For example, 
consider the problem of tossing six coins. The most prob- 
able number of heads is 3, and the probabiHty of a dis- 
crepancy equal to or less than 1 is given exactly by 

/ 6! 6! 6! \ 1 ^25 
V3!3!"^4!2l"*"2!4!/64 32' 

which is the sum of the probabilities that the number of 
heads will be 2, 3, or 4 for 5 = 6 coins. But spq = l.Sy and 
(spqY^^ = 1.225. Then using -3/2 to 3/2 as limits of 
integration in (19), we obtain from a table of the probabil- 
ity integral the approximate value .779 to compare with 
the exact value 25/32 = .781. 

For certain purposes, there is an advantage in chang- 
ing the variable a: to / in (17) and (18) by the transforma- 

- _, _i . 

(spqy/^ ' ispqy^' 

Then in place of (17) we have 

and in place of (18) we have 


To give a general notion of the magnitude of the 
probabilities, we shall now list a few values of P& in (20) 
corresponding to assigned values of 5. Thus, 

5... .6745 12 3 4 
P5... .5 .68269 .95450 .99730 .99994 

Extensive tables giving values of the probability 
integral and of the ordinates of the probability curve are 
readily available. For example, the Glover Tables oj Ap- 
plied Mathematics^ give P5/2 for the argument b = x/(T 
Sheppard's table^ gives {\-\-Pi)/2 for the argument 
5 = x/<T. 

We may now state the De Moivre-Laplace theorem 
in another form by saying that, the values of P5 in (20) 
and (21) give approximations to the probability that 
I m — sp I <8{spqy^^ for an assigned positive value of 5. 

In still another slightly different form involving rela- 
tive frequencies, we may state that the values of Ps in 
(20) and (21) give approximations to the probability that 
the absolute value of the relative discrepancy satisfies the 


(22) : 7 

for every assigned positive value of 5. 

In order to gain a fuller insight into the significance of 
the De Moivre-Laplace theorem we may draw the follow- 
ing conclusions from (20) : (a) Assuming as is suggested 
by (20) that a 5 exists corresponding to every assigned 
probability Pa, we find from d = d{spqy^^ that the bounds 
— d to -\-d increase in proportion to s^^^ as 5 is increased 
(b) From (20) and (22) it follows that for assigned prob- 


abilities Pa the bounds of discrepancy of the relative 
frequency mis from p vary inversely as 5^^^, 

To illustrate the use of the De Moivre-Laplace 
theorem, we take an example from the third edition of 
the American Men of Science by Cattell and Brimhall 
(p. 804). A group of scientific men reported 1,705 sons 
and 1,527 daughters. The examination of these numbers 
brings up the following fundamental questions of simple 
sampling. Do these data conform to the hypothesis that 
1/2 is the probability that a child to be born will be a 
boy? That is, can the deviations be reasonably regarded 
as fluctuations in simple sampling under this hypoth- 
esis? In another form, what is the probabihty in throw- 
ing 3,232 coins that the number of heads will differ 
from (3,232/2) = 1,616 by as much, as or more than, 
1,705-1,616 = 89? 

In this case, 

5 = 3,232, (/>^5) 1/2 =28.425, 


t/= 1,705- 1,616 = 89, j—r^^ = Z.n\ 

Referring to a table of the normal probability in- 
tegral, we find from (20) that P = .9983. Hence, the prob- 
ability that we will obtain a deviation more than 89 on 
either side of 1,616 in a single trial is approximately 
1-. 9983 = .0017. 

13. The quaxtile deviation. The discrepancy d which 
corresponds to the probability P = 1/2 in (20) is some- 
times called the quartile deviation, or the probable error of 
m as an approximation to sp. 

By the use of a table of the probability integral, it is 
found from (20) that J = .6745 {spqY^^ approximately 


when P=\/2, and thus .61^S{spqy^'^ is the quartile 
deviation of the number of successes from the expecta- 
tion sp. 

14. The law of small probabilities. The Poisson ex- 
ponential function. The De Moivre-Laplace theorem does 
not ordinarily give a good approximation to the terms of 
the binomial {p-{-qy ii p or q is small. If s is large but 
sp or sq is small in relation to s, we may give a useful 
representation of terms of the binomial expansion (p+qY 
by means of the Poisson exponential function. Statistical 
examples of this situation are what may be called rare 
events and may easily be given : The number born blind 
per year in a city of 100,000, or the number dying per 
year of a minor disease. 

Poisson^° had already as early as 1837 given the func- 
tion involved in the treatment of the problem. Bort- 
kiewicz" took up the problem in connection with a long 
series of observations of events which occur rarely. For 
example, one well-known series he gave was the frequency 
distribution of the number of men killed per army corps 
per year in the Prussian army from the kicks of horses. 
The frequency distribution of the number of deaths per 
army corps per year was: 












He called the law of frequency involved the "law of small 
numbers," and this name continues to be used although 
it does not seem very appropriate. The expression ''law of 
small probabilities" seems to give a more accurate de- 
scription. Assume, then, that the probability p is small 
and that q = l-pis nearly unity. That is, p is the prob- 


ability of the occurrence of the rare event in question in a 
single trial. 

We then seek a convenient expression approximately 
equal to 

mini ^ ' 

the probability of m occurrences and n non-occurrences 
in m+n = s trials. 

Replacing si and n ! by means of Stirling's formula we 

{l — m/sy+*ml 

With large values of s and relatively small values of w, 
(l — m/sY'^^^^ differs relatively little from (1— m/^)/ and 
this in turn differs relatively little from e~"*. Further- 
more, ^" = (1—^)" differs very little from e"**^ since, on 
the one hand, 

and, on the other, 

Introducing these approximations by substituting e' 
for (1— m/5)*+^/^, and e~"^ for g**, we have 



For rare events, of small probability />, np differs very 
little from sp — X. Hence, we write 

(23) ^- = ^ 

for the approximate probability of m occurrences of the 
rare event. Then the terms of the series 


give the approximate probabilities of exactly 0, 1, 2, 
. . . . , occurrences of the rare event in question, and the 
sum of series 

(24) e-(i+x+|+|+....+^) 

gives the probability that the rare event F will happen 
either 0, 1, 2, . . . . , or w times in s trials. 

Although we have assumed in deriving the Poisson 
exponential function X"*e~^/w! that m is small in com- 
parison with ^, we may obtain certain simple and inter- 
esting results for the mathematical expectation and stand- 
ard deviation of the distribution given by the Poisson 
exponential when m takes all integral values from w = 
to w = 5. Thus, when w = 5 in (24), we clearly have 

(25) e-(l+X+|+|+-.-.+^^) = l 

approximately if s is large. 

Since the successive terms in (25) give approximately 
the probabilities of 0, 1, 2, . . . . , 5 occurrences of the 


rare event, the mathematical expectation X^wP«, of the 
number of such occurrences is 


(\2 \5-l \ 

approximately when s is large. 

Similarly, the second moment /iz about the origin is 

^=,-x[x+2V+|'+ .... +^] , 

and the second moment about the mathematical expec- 
tation is 

[•2X2 cX*-l "1 

(26)1 =^-i^+^+i;+(S)!] 

= X+X^— X^ nearly =sp , 

an approximation to spq since q differs but little from 1. 
Tables of the Poisson exponential limit e~^\*/x\ are 
given in Tables for Statisticians and Biometricians (pp. 
113-24), and in Biometrika, Volume 10 (1914), pages 25- 
35. The values of e~'^\*/x\ are tabulated to six places of 
decimals for X varying from .1 to 15 by intervals of one- 
tenth and for x varying from to 37. 



A general notion of the values of the function for 
certain values of X may be obtained from Figure 4 where 
the ordinates at 0, 1, 2, .... , show the values of the 
function for X = .5, 1, 2, and 5. 

Miss Whittaker has prepared special tables {Tables 
for Statisticians and Biometricians^ pp. 122-24) which 
facilitate the comparison of results from the Poisson ex- 
ponential with those from 
the De Moivre-Laplace 
theory in dealing with the 
sampling fluctuations of 
small frequencies. The 
question naturally arises 
as to the value of p below 
which we should prefer to 
use the Poisson exponen- 
tial in dealing with the 
probability of a discrep- 
ancy less than an assigned 
number in place of the results of the De Moivre-Laplace 
theory. While there is no exact answer to this question, 
there seems to be good reason for certain purposes in 
restricting the application of the De Moivre-Laplace 
results to cases where the probability is perhaps not less 
than ^ = .03. 

To illustrate by a concrete situation in which p is 
small, consider a case of 6 observed deaths from pneu- 
monia in an exposure of 10,000 lives of a well-defined class 
aged 30 to 31. It is fairly obvious, on the one hand, thac 
the possible variations below 6 are restricted to 6, whereas 
there is no corresponding restriction above 6. On the 
other hand, if we take (6/10,000) = 3/5,000 as the prob- 






























Fig. 4 

9 10 11 12 X 


ability of death from pneumonia within a year of a per- 
son aged 30, it is more likely that we shall experience 5 
deaths than 7 deaths among the 10,000 exposed; for the 

/ 10,000\ / 4,997 y-^QY 3 y 
\ 5 A5,000/ V5,000/ 

of 5 deaths is greater than the probability 

/10,000\ / 4,997 y'QQ3 / 3 y 
V 7 /V5,000/ \5,000/ 

of 7 deaths. 

Suppose we now set the problem of finding the prob- 
ability that upon repetition with another sample of 
10,000, the deviation from 6 deaths on either side will not 
exceed 3. The value to three significant figures calcu- 
lated from the binomial expa^ision is .854. To use the 
De Moivre-Laplace theorem, we simply make J = 3 in 
(19), and obtain from tables of probability functions the 
value P3 = . 847. 

We should thus expect from the De Moivre-Laplace 
theorem a discrepancy either in defect more than 3 or in 
excess more than 3 in 100—84.7 = 15.3 per cent of the 
cases, and from the sum of the binomial terms we should 
expect such a discrepancy in 100— 85.4 = 14.6 per cent of 
the cases. 

Turning next to tables of the Poisson exponential, 
page 122 of Tables for Statisticians and BiometricianSy we 
find that in 6.197 per cent of cases there will be a dis- 
crepancy in defect more than 3 and in 8.392 per cent of 
cases there will be a discrepancy in excess more than 3. 


The sum of 6.197 and 8.392 per cent is 14.589 per cent. 
This result differs very little for purposes of dealing with 
sampling errors from the 15.3 per cent given by the 
De Moivre-Laplace formula, but it is a closer approxima- 
tion to the correct value and has the advantage of showing 
separately the percentage of cases in excess more than 
the assigned amount and the percentage in defect more 
than the same amount. 


15. Introduction. In Chapter I we have discussed very 
briefly three different methods of describing frequency 
distributions of one variable — the purely graphic method, 
the method of averages and measures of dispersion, and 
the method of theoretical frequency functions or curves. 
The weakness and inadequacy of the purely graphic meth- 
od lies in the fact that it fails to give a numerical descrip- 
tion of the distribution. While the method of averages 
and measures of dispersion gives a numerical description 
in the form of a summary characterization which is likely 
to be useful for many statistical purposes, particularly for 
purposes of comparison, the method is inadequate for 
some purposes because (1) it does not give a character- 
ization of the distribution in the neighborhood of each 
point X or in each small interval xtox-{-dxoi the variable, 
(2) it does not give a functional relation between the 
values of the variable x and the corresponding frequen- 

To give a description of the distribution at each small 
interval x to x+dx and to give a functional relation be- 
tween the variable x and the frequency or probability we 
require a third method, which may be described as the 
"analytical method of describing frequency distribu- 
tions." This method uses theoretical frequency functions. 
That is, in this method of description we attempt to char- 
acterize the given observed frequency distribution by ap- 



pealing to underlying probabilities, and we seek a fre- 
quency function y = F(:«:) such that F{x)dx gives to within 
infinitesimals of higher order the probability that a vall- 
ate x' taken at random falls in the interval x to x+dx. 

Although the great bulk of frequency distributions 
which occur so abundantly in practical statistics have cer- 
tain important properties in common, nevertheless they 
vary sufficiently to present difficult problems in consider- 
ing the properties of F(x) which should be regarded as 
fundamental in the selection of an appropriate function 
to fit a given observed distribution. 

The most prominent frequency function of practical 
statistics is the normal or so-called Gaussian function 

<« ^=^^ ^ 

where a is the standard deviation (see Fig. 3, p. 35). 

Although Gauss made such noteworthy contributions 
to error theory by the use of this function that his name 
is very commonly attached to the function, and to the 
corresponding curve, it is well known that Laplace made 
use of the exponential frequency function prior to Gauss 
by at least thirty years. It would thus appear that the 
name of Laplace might more appropriately be attached 
to the function than that of Gauss. But in a recent and 
very interesting historical note, Karl Pearson* finds that 
De Moivre as early as 1733 gave a treatment of the prob- 
ability integral and of the normal frequency function. 
The work of De Moivre antedates the discussion bf La- 
place by nearly a half-century. Moreover, De Moivre's 


treatment is essentially our modern treatment. Hence it 
appears that the discovery of the normal function should 
be attributed to De Moivre, and that his name might be 
most appropriately attached to the function. It may well 
be recalled that we obtained this function (1) in the De 
Moivre-Laplace theory (p. 34). In (1) the origin is taken 
so that the a;-co-ordinate of the centroid of area under the 
curve is zero. The approximate value of the centroid may 
be obtained from a large number of observed variates by 
finding their arithmetic mean. The a is equal to the radi- 
us of gyration of the area under the curve with respect 
to the ^'-axis, and is obtained approximately from ob- 
served variates by finding their standard deviation. The 
probability or frequency function (1) has been derived 
from a great variety of hypotheses. ^^ The difficulty is not 
one of deriving this function but rather one of establish- 
ing a high degree of probability that the hypotheses un- 
derlying the derivation are realized in relation to practical 
problems of statistics. 

In the decade from 1890 to 1900, it became well estab- 
lished experimentally that the normal probability func- 
tion is inadequate to represent many frequency distribu- 
tions which arise in biological data. To meet the situation 
it was clearly desirable either to devise methods for char- 
acterizing the most conspicuous departures from the 
normal distributions or to develop generalized frequency 
curves. The description and characterization of these de- 
partures without the direct use of generalized frequency 
curves has been accomplished roughly by the introduction 
(see pp. 68-72) of measures of skewness and of peakedness 
(excess or kurtosis), but the rationale underlying such 
measures is surely to be sought most naturally in the 


properties of generalized frequency functions. In spite of 
the reasons which may thus be advanced for the study of 
generalized frequency curves, it is fairly obvious that, for 
the most part, the authors of the rather large number of 
recent elementary textbooks on the methods of statistical 
analysis seem to regard it as undesirable or impracticable 
to include in such books the theory of generalized fre- 
quency curves. The writer is inclined to agree with these 
authors in the view that the complications of a theory of 
generalized frequency curves would perhaps have carried 
them too far from their main purposes. Nevertheless, 
some results of this theory are important for elementary 
statistics in providing a set of norms for the description 
of actual frequency distributions. In order to avoid mis- 
understanding it should perhaps be said that it is not 
intended to imply that a formal mathematical representa- 
tion of many numerical distributions is desirable, but 
rather that a certain amount of such representation of 
carefully selected distributions should be encouraged. A 
useful purpose will be served in this connection if we can 
make certain points of interest in the theory more accessi- 
ble by means of the present monograph. 

The problem of developing generalized frequency 
curves has been attacked from several different directions. 
Gram (1879), Thiele (1889), and Charlier (1905) in Scan- 
dinavian countries; Pearson (1895) and Edgeworth 
(1896) in England; and Fechner (1897) andBruns (1897) 
in Germany have developed theories of generalized fre- 
quency curves from viewpoints which give very different 
degrees of prominence to the normal probability curve in 
the development of a more general theory. In the present 
monograph, special attention will be given to two systems 


of frequency curves — the Pearson system and the Gram- 
Charlier system. 

16. The Pearson system of generalized frequency 
curves. Pearson's first memoir^^ dealing with generalized 
frequency curves appeared in 1895. In this paper he gave 
four types of frequency curves in addition to the normal 
curve, with three subtypes under his Type I and two sub- 
types under his Type III. He published a supplementary 
memoir^* in 1901 which presented two further types. A 
second supplementary memoir^^ which was published in 
1916 gave five additional t3^es. Pearson's curves, which 
are widely different in general appearance, are so well 
known and so accessible that we shall take no time to 
comment on them as graduation curves for a great variety 
of frequency distributions, but we shall attempt to indi- 
cate the genesis of the curves with special reference to the 
methods by which they are grounded on or associated 
with underlying probabilities. 

We shall consider a frequency function y = F(x) of one 
variable where we assume that F{x)dx differs at most by 
an infinitesimal of higher order from the probability that 
a variate x taken at random will fall into the interval x 
to x-\-dx. Pearson's types of curves y = F(ic) are obtained 
by integration of the differential equation 

(2) (fy^ {x-{-a)y 

dx Co-\-CiX-\-C2X^ * 

and by giving attention to the interval on ic in which 
y = F{x) is positive. The normal curve is given by the spe- 
cial case Ci = C2 = 0. We may easily obtain a clear view of 
the genesis of the system of Pearson curves in relation to 


laws of probability by following the early steps in the de- 
velopment of equation (2) . The development is started by 
representing the probabiKties of successes in n trials given 
by the terms of the symmetric point binomial (1/2 + 1/2)** 
as ordinates of a frequency polygon. It is then easily 
proved that the slope dy/dx of any side of this polygon, 
at its midpoint, takes the form 

(3) g=_42(^ + a)y, 

where y is the ordinate at this point, and a and k are 
constants. By integration, we obtain the curve for which 
this differential equation is true at all points. The curve 
thus obtained is the normal curve (Pearson's Type VII). 
The next step consists in dealing with the asymmetric 
point binomial {p-\-qYi p=^q, in a manner analogous to 
that used in the case of the symmetric point binomial. 
This procedure gives the differential equation 

dy_ (x-\-a)y 
dx Co-\-CiX * 

from which we obtain by integration the Pearson Type 
III curve 

(4) y=yo(i+^) 

That is, with respect to the slope property, this curve 
stands in the same relation to the values given by the 
asymmetric binomial polygon as the normal curve does 
to values given by the symmetric binomial. 

Thus far the underlying probability of success has 


been assumed constant. The next step consists in taking 
up a probability problem in which the chance of success 
is not constant, but depends upon what has happened 
previously in a set of trials. Thus, the chance of getting 
r white balls from a bag containing np white and nq black 
balls in drawing s balls one at a time without replacements 
is given by 


{n)s (np-r)[{nq-s-{-r)\nlrlis-r)\ ' 

where («), means the number of permutations of n things 

taken 5 at a time and (^) is the number of combinations 

of s things r at a time. This expression is a term of a 
hypergeometric series. By representing the terms of this 
series as ordinates of a frequency polygon, and finding the 
slope of a side of the frequency polygon, and proceeding 
in a manner analogous to that used in the case of the 
point binomial, we obtain a differential equation of the 
form given in (2). Thus, we maker = 0, 1, 2, . . . . , ^and 
obtain the 5+ 1 ordinates yo, yi, y2, . . . . , y* at unit inter- 
vals. At the middle point of the side joining the tops of 
ordinates yr and yr+i, we have 

(6) «=f+§ , y^iiyr+yr+i) , 




— S'{-nps — nq—l — r{n-\-2) 
"■^^ {r+l){r+l+nq-s) 

dy \^'~^ np—r 


From 3;= (>'r+>'f-t-i)/2, we have 



{r-\-l){r+l + nq-s) 


_j nps-{-nq-\-l — s-{-r{nq-]-2 — np — 25)-^2r'^ 
"^^"^ (r+l)(r+l + nq-s) 

From (7) and (8), replacing r by ic— 1/2, we have 


1 dy_ 2s+2nps-2nq-2-{2x-l){n-\-2) 

y dx~nps-\-nq+i-5+{x-i){nq+2-np-2s) + 2{x-iy 

From (9), we observe that the slope of the frequency 
polygon, at the middle point of any side, divided by the 
ordinate at that point is equal to a fraction whose numer- 
ator is a linear function of x and whose denominator is a 
quadratic function of x. 

The differential equation (2) gives a general statement 
of this property. It is more general than (9) in that the 
constants of (9) are special values found from the law of 
probability involved in drawings from a limited supply 
without replacements. One of Pearson's generalizations 
therefore consists in admitting as frequency curves all 
those curves of which (2) is the differential equation with- 
out the limitations on the values of the constants involved 
in (9). 

The questions involved in the integration of (2) and 
in the determination of parameters for actual distribu- 
tions are so available in Elderton's Frequency Curves and 
Correlation, and elsewhere, that it seems undesirable to 
take the space necessary to deal with these questions 
here. The resulting types of equations and figures that 
indicate the general form of the curves for certain positive 
values of the parameters are listed below. 


Type I (Fig. 5) 

'-'■hiTHT' f 




Wi tK2 


a, 0. —.o ^^^^ 

Type II (Fig. 6) 





Type in (Fig. 7) 






Type IV 

Fig. 7 


A skew curve of unlimited range at both ends, roughly 
described in general appearance as a slightly deformed 
normal curve (for the normal curve, see Fig. 3, p. 35). 

Type V (Fig. 8) 

y = yQX~^e * . 


Fig. 8 

Type VI (Fig. 9) 


y=y{i{x—(j)^ix ^1 . 

Fig. 9 

Type VII (Fig. 3, p. 35) 

_.,_., 2<r^ 

The normal frequency curve. 

Type VIII (Fig. 10) 

y = yo 



Fig. 10 


This type degenerates into an equilateral hyperbola 
when w = l. 

Type IX (Fig. 11) 

= ,o(l+^)". 

x=-a o 

Fig. 11 

This type degenerates into a straight line when w = 1, 
Type X (Fig. 12) 

n ■ 
y = -e 

This type is Laplace's first frequency curve while the 
normal curve is sometimes called his second frequency 
curve. The curve is shown for negative values of ± x/a; 

Type XI (Fig. 13) 



Fig. 13 

Type XII (Fig. 14) 


_ / ai-{-x \ 

Fig. 14 

The above figures should be regarded as roughly illus- 
trating only in a meager way, for particular positive val- 
ues of the parameters, the variety of shapes that are 
assumed by the Pearson type curves. For example, it is 
fairly obvious that Types I and II would be U-shaped 
when the exponents are negative, and that Type III 
would be J-shaped if 7a were negative. 

The idea of obtaining a suitable basis for frequency 
curves in the probabilities given by terms of a hyper- 
geometric series is the main principle which supports the 
Pearson curves as probability or frequency curves, rather 
than as mere graduation curves. That is to say, these 
curves should have a wide range of applications as proba- 
bility or frequency curves if the distribution of statistical 
material may be likened to distributions which arise imder 
the law of probability represented by terms of a hyper- 
geometric series, and if this law may be well expressed by 
determining a frequency function y — F{x) from the slope 
of the frequency polygon of the hypergeometric series. 
In examining the source of the Pearson curves, the fact 
should not be overlooked that the normal probability 
curve can be derived from hypotheses containing much 
broader implications than are involved in a slope condi- 
tion of the side of a symmetric binomial polygon. 


The method of moments plays an essential r61e in the 
Pearson system of frequency curves, not only in the de- 
termination of the parameters, but also in providing 
criteria for selecting the appropriate type of curve. Pear- 
son has attempted to provide a set of curves such that 
some one of the set would agree with any observational 
or theoretical frequency curve of positive ordinates by 
having equal areas and equal first, second, third, and 
fourth moments of area about a centroidal axis. 

Let Urn be the mth moment coefficient about a centroid 
vertical taken as the y-axis (cf. p. 19). That is, let 

(10) Mm= f'^x-^F{x)dx, 


where F(x) is the frequency function (see p. 13). 
Next, let 



Then it is Pearson's thesis that the conditions mo = 1, 
Ml = 0, together with the equality of the numbers ^2, ^i, 
and ft, for the observed and theoretical curves lead to 
equations whose solutions give such values to the par- 
ameters of the frequency function that we almost invaria- 
bly obtain excellency of fit by using the appropriate one 
of the curves of his system to. fit the data, and that bad- 
ness of fit can be traced, in general, to heterogeneity of 
data, or to the difficulty in the determination of moments 
from the data as in the case of J- and U-shaped curves. 


Let us next examine the nature of the criteria by 
which to pass judgment on the type of curve to use in 
any numerical case. Obviously, the form which the inte- 
gral y = F{x) obtained from (2) takes depends on the 
nature of the zeros of the quadratic function in the de- 
nominator. An examination of the discriminant of this 
quadratic function leads to equalities and inequalities in- 
volving 01 and 02 which serve as criteria in the selection 
of the type of function to be used. A systematic procedure 
for applying these criteria has been thoroughly developed 
and published in convenient form in Pearson's Tables for 
Statisticians and Biometricians (1914), pages Ix-Lxx and 
66-67 ; and in his paper in The Philosophical Transactions j 
A, Volume 216 (1916), pages 429-57. The relations be- 
tween 01 and 02 may be conveniently represented by 
curves in the /^i-ft plane. Then the normal curve corre- 
sponds to the point i3i = 0, ^ = 3 in this plane. Type III 
is to be chosen when the point {0i, ft) is on the line 
2/32 -3/3i- 6 = 0; and Type V, when (ft, ft) is on the 

ft(ft+3)2 = 4(4ft-3ft)(2ft-3ft-6) . 

In considering subtypes under Type I, a biquadratic 
in 01 and ft separates the area of J -shaped modeless curves 
from the area of limited range modal curves and the area 
of U-shaped curves. 

Without going further into detail about criteria for 
the selection of the type of curve, we may summarize by 
saying that curves traced on the 0i ft-plane provide the 
means of selecting the Pearson type of frequency curve 
appropriate to the given distribution in so far as the neces- 
sary conditions expressed by relations between 0i and ft 


turn out to be sufficient to determine a suitable type of 

The difficulties involved in the numerical computation 
of the parameters of the Pearson curves were rather clear- 
ly indicated in Pearson's original papers. The appropriate 
tables and forms for computations in fitting the curves to 
numerical distributions have been so available in various 
books as to facilitate greatly the applications to concrete 
data. Among such books and tables, special mention 
should be made of Frequency Curves and Correlation 
(1906), by W. P. Elderton, pages 5-105; Tables for Statis- 
ticians and Biometricians (1924), by Karl Pearson; and 
Tables of Incomplete Gamma Functions (1921), by the 
same author. 

17. Generalized normal curves — Gram-Charlier se- 
ries. Suppose some simple frequency function such as the 
normal function or the Poisson exponential function 
(p. 41) gives a rough approximation to a given frequency 
distribution and that we desire a more accurate analytic 
•representation than would be given by the simple fre- 
quency function. In this situation, it seems natural to 
seek an analytical representation by means of the first 
few terms of a rapidly convergent series of which the first 
term, called the "generating function,'' is the simple fre- 
quency function which gives the rough approximation. 

Prominent among the contributors to the method of 
the representation of frequency by a series may be named 
Gram,i6 Thiele,^^ Edgeworth,i8 Fechner,^^ Bruns,^^ Char- 
lier,2i 3^j^(j Romano vsky .22 

Our consideration of series for the representation of 
frequency will be limited almost entirely to the Gram- 
Charlier generalizations of the normal frequency function 


and of the Poisson exponential function, by using these 
functions as generating functions. These two types of 
series may be written in the following forms: 

Type A 
(11) F(x) = ao(t>{x)+as<t>^'Kx)+ -f a„</>(«>W+ , 




and <j)^''^(x) is the wth derivative of <l){x) with respect to x. 

Type B 
(12) F{x)=Co^P{x)-\-ClAHx)+ • • • • +CnA^{x)+ . . . . , 


, , V e~^ sin xrc f 1 X , X^ 1 


which is the Poisson exponential for non-negative integral 
values of x^ and where A\l/(x), A^W, . . . . , denote 
the successive finite differences of \l/{x) beginning with 

If Type A or Type B converges so rapidly that terms 
after the second or third may be neglected, it is fairly 


obvious that we have a simple analytic representation of 
the distribution. 

The general appearance of the curves represented by 
two or three terms of Type A, for particular values of the 
coefficients, is shown in Figure 15 so as to facilitate com- 

I. y=<i>{x)j the normal curve 
II. y=<f>{x)+M^'Kx) 
ni. y=<t>{x)+^<i>(^)ix)+j\<t>(^){x) 

parison with the corresponding normal curve represented 
by the first term. 

A general notion of the values of the function repre- 
sented by the first term of Type B may be obtained for 
particular values of X from Figure 4, page 43. When X is 
taken equal to the arithmetic mean of the number of 
occurrences of the rare event in question, we shall find 
that Ci = 0. We may then well inquire into the general 
appearance of the graph of the function 



for particular values of C2. and X. For X = 2 and C2 = — 4, 
see Figure 16, which shows also the corresponding yf/{x). 


I. y=xl^{x) 
n. y=^(x)-.4AV(af) 

It should probably be emphasized that the usefulness 
of a series representation of a given frequency distribu- 
tion depends largely upon the rapidity of convergence. 
In turn the rapidity with which the series converges de- 
pends much upon the degree of approach of the generat- 
ing function to the given distribution. 

Although it is known^^ that the Type A series is capa- 
ble of converging to an arbitrary fimction/(a;) subject to 
certain conditions of continuity and vanishing at infinity, 
mere convergence is not sufficient for our problems. The 
representation of an actual frequency distribution re- 
quires, in general, such rapid convergence that only a few 
terms will be found necessary for the desired degree of 
approximation because (1) the amount of labor in compu- 
tation soon becomes impracticable as the number of terms 


increases and (2) the probable errors of high-order mo- 
ments involved in finding the parameters would generally 
be so large that the assumption that we may use moments 
of observations for the theoretical moments will become 

18. Remarks on the genesis of the T3rpe A and Tjrpe 
B forms. We naturally ask why a generalization of the 
normal frequency function should take the form of Type 
A rather than some other form, say the product of the 
generating function by a simple polynomial of low degree 
in X or by an ordinary power series mx. k similar ques- 
tion might be asked about the generalization of the Pois- 
son exponential function. There seems to be no very sim- 
ple answer to these questions. It is fair to say that alge- 
braic and numerical convenience, as well as suggestions 
from underlying probabiUty theory, have been significant 
factors in the selection of Type A and Type B. The alge- 
braic and numerical convenience of Type A becomes fairly 
obvious by following Gram in determining the par- 
ameters. The suggestion of these forms in probabihty 
theory is closely associated with the development of the 
hypothesis of elementary errors (deviations) as given by 
Charlier.2i A very readable discussion of the manner in 
which the Type A series arises in the probability theory 
of the distribution of a variate built up by the summation 
of a large number of independent elements is given in the 
recent book by Whittaker and Robinson on The Calculus 
of Observations y pages 168-74. 

In the present monograph, we shall limit our discus- 
sion of the probability theory underlying T)^es A a^d B 
to showing in Chapter VII that a certain Une of develop- 
ment of the binomial distribution suggests the use of the 


Type A series as an extension of the ordinary De Moivre- 
Laplace approximation, and the Type B series as an ex- 
tension of the Poisson exponential approximation con- 
sidered in Chapter II. This development is postponed to 
the final chapter of the book because it involves more 
formal mathematics than some readers may find it con- 
venient to follow. Certain important results derived in 
Chapter VII are stated without proof in §§ 19-21. While 
a mastery of the details of Chapter VII is not essential 
to an understanding of the results given in §§ 19-21, the 
reader who can follow a formal mathematical develop- 
ment without special difficulty may well read Chapter 
VII at this point instead of reading §§ 19-21. In § 56 of 
Chapter VII we follow closely the recent work of Wick- 
selp4 in the development of the forms of the Type A and 
Type B series. Then in §§ 57-59 we deal with the princi- 
ples involved in the determination of the parameters in 
these type forms. 

19. The coefficients of the Type A series expressed in 
moments of the observed distribution. If we measure x 
from the centroid of area as an origin and with units equal 
to the standard deviation, c, we may write the Type A 
series in the form 



F{x)^<t>{x)-\-az4>^'\x)+a,<l>^'Kx)+ .... + an<t>^^Kx) 

1 " -«V2 

+ .. 

where 0(^) = 



and <f>^"^(x) is the »th derivative of <t)(x) with respect 
to X. 


It will be shown in § 57 that the coefficients an for 
(w = 3, 4, . . . .) may then be expressed in the form 

(14) a,=''-^-^\'°F{x)H„{x)dx, 



g.(,)„,..-!l(!Lll)^..-.+ >'("-')(^^7^2)(>'-3) ,^. 

is a so-called Hermite polynomial. 

To determine a„ numerically, we replace F{x) in (14) 
by the corresponding observed frequency function f(x), 
and replace x by x/cr if we measure x with ordinary units 
(feet, pounds, etc.) instead of using the standard devia- 
tion as the imit. Then we may write 

(15) o,= (^J^y(^)H„g)i^ . 

Insert the values of E nix Id) for w = 3, 4, 5 in (15), 
and we obtain coefficients in terms of moments as follows, 
using the symbol a, for the quotient /Xj/o^ 

n — M3 _ a« 
°'^~ 0^3!"" 3! 

06= -^ (M5- 10M3<r2) = -- (as- lOas) 


20. Remarks on two methods of determining the co- 
efficients of the TjTpe A series. It will be shown in § 57 
that formula (14) for any coefficient a„ of the Type A 
series may be derived by making use of the fact that 
0^"^(a:) and the Hermite polynomials Hf^{x) form a 
biorthogonal system. Then as indicated on page 168 we 
obtain a^ in terms of moments of the observed distribu- 

As a second method of obtaining an in terms of the 
moments of the observed distribution f{x)^ it will be 
shown in § 58 that the values of the coefficients given in 
§ 19 may be derived by imposing the least-squares crite- 
noTi^ that 

(16) F= r^ -^[f{x)-F{x)]'dx 

shall be a minimum. 

21. The coefficients of the Type B series. For the 

Type B series (12), we shall for simplicity limit the deter- 
mination of coefficients to the first three terms. More- 
over, we shall restrict our treatment to a distribution of 
equally distant ordinates at non-negative integral values 
of X. Then the problem is to find the coefficients Co, Ci, c^ in 

F{x) = c4{x)^\-c,^^l^{x)+C2^^{x) , 

fora;=0, 1, 2, . . . . 


By expressing the coefficients in terms of moments of 
the observed distribution as shown in § 59, we find 

when X is taken equal to the arithmetic mean of the given 
observed values. 

22. Remarks. With respect to the selection of Type 
A or Type B of Charlier to represent given numerical 
data, no criterion corresponding to the Pearson criteria 
has been given which enables one to distinguish between 
cases in which to apply one of these types in preference 
to the other, but T3^e B applies, in general, to certain 
decidedly skew distributions; and, in particular, to dis- 
tributions of variates having a natural lower or upper 
bound with the modal frequency much nearer to such 
natural bound than to the other end of the distribution. 
For example, a frequency distribution of the number dy- 
ing per month in a city from a minor disease would have 
the modal value near zero, the natural lower bound. 

While the systematic procedure in fitting Charlier 
curves to data is not so well standardized as the methods 
used in fitting curves of the Pearson system to data, 
tables of 0(/), where t is in units of standard deviation, of 
its integral from to /, and of its second to eighth deriva- 
tives are given to five decimal places for the range / = 
to / = 5 at intervals of .01 by James W. Glover,^ and tables 
of the function, its integral and first six derivatives are 
given by N. R. Jorgensen^® to seven decimal places for 
/ = to/ = 4. 

23. Skewness. Charlier has fittingly called the coeffi- 
cients ^3, a^, ^6, . . . . , along with the mean and standard 


deviation, the characteristics of the distribution. The co- 
efficients az and 6^4 niay be interpreted so as to give charac- 
teristics which appear very significant in a description of 
a distribution to a general reader with little or no mathe- 
matical training. It is the common experience of those 
who have dealt with actual distributions of practical sta- 
tistics that many of the distributions are not symmetrical. 
A. measure is needed to indicate the degree of asymmetry 
or skewness of distributions in order that we may de- 
scribe and compare the degrees of skewness of different 

A measure of skewness is given by 

(17) 5=-3a3 = ^ = |a8. 

Another measure of skewness is 

,.Q. „_Mean— Mode 
{10) o . 

In this latter measure we have adopted a convention as 
to sign by which the skewness is positive when the mean 
is greater than the mode. Some authors define skewness 
as equal numerically but opposite in sign to the value in 
our definition. 

We may easily prove that the measures (17) and (18) 
are equal for a distribution given by the Pearson T3^e 
III curve, and approximately equal for a distribution giv- 
en by the first two terms of the Gram-Charlier Type A 
when S as defined in (17) is not very large. 

For the Pearson Type III (p. 54), 

dy^ {x-{-a)y 
dx Co-\-CiX 


When the parameters in this equation are expressed in 
moments about the mean, the equation takes the form 

ldy_ x-{-nz/2(T ^ 
ydx iJL2-{-fJLsx/2(T^ ' 

if the origin is at the mean of the distribution. The mode 
is the value of x for which 

dy ^ M3 

dx la^ 

That is, 

Mean — Mode _ m3 
~o 'la' 

Hence the measures (17) and (18) are equal for the Type 
III distribution. 

For a distribution given by the first two terms of Type 
A, we are to consider the frequency curve 



We shall now prove that the distance from the mean 
(origin) to the mode is approximately —aS when 5 is 
fairly small. 


We have from (19) 

if we neglect terms in 5^. Then 

ldy_ X '5' x^5'_„ 
ydx a^ a cr^ 

at the mode. Solving the quadratic for x we obtain 
x= —aS a we neglect terms of the order S^. Hence, the 
measures (17) and (18) are approximately equal for a dis- 
tribution given by the first two terms of a Gram-Charlier 
Type A series. 

24. Excess. In the general description of a given fre- 
quency distribution, we may add an important feature to 
the description by considering the relative number of 
variates in the immediate neighborhood of some central 
value such as the mean or the mode. That is, it would 
add to the description to give a measure of tlie degree of 
peakedness of a frequency curve fitted to a distribution 
by comparison with the corresponding normal curve 
fitted to the same distribution. The measure of the peak- 
edness to which we shall now give attention is sometimes 
called the excess and sometimes the measure of kurtosis. 

The excess or degree of kurtosis is measured by 

£ = 3a4 = i(^:-3)=i(a4-3). 

If the excess is positive, the number of variates in the 
neighborhood of the mean is greater than in a normal dis- 


tribution. That is, the frequency curve is higher or more 
peaked in the neighborhood of the mean than the corre- 
sponding normal curve with the same standard deviation. 
On the other hand, if the excess is negative, the curve is 
more flat topped than the corresponding normal curve. 
To obtain a clearer insight into the relation of the measure 
of excess to the theoretical representation of frequency, 
let us consider a Gram-Charlier series of T3^e A to three 

y = <t>(x)+ az4>^'^ (x) + a,<t>^'^ (x) 
(7(27r)i/2[^ acr^U W 





When we compare the ordinate of (20) at the mean 
x^Q with the ordinate l/(r(27r)^''^ at the mean for the 
normal curve, we observe that this ordinate exceeds the 
corresponding ordinate of the normal curve by E/(T{2iry^^. 
That is, the excess E is equal to the coefGcient by which 
to multiply the ordinate at the centroid of the normal 
curve to get the increment to this ordinate as calculated 
by retaining the terms in 0^^^(a:) and (l>^^\x) of the Type 
A series. 

25. Remarks on the distribution of certain trans- 
formed variates. Underlying our discussion of frequency 
functions, there has perhaps been an implication that 


the various types of distribution could be accounted for 
by an appropriate theory of probability. There may, how- 
ever, be other than chance factors that produce significant 
effects on the type of the distribution. Such effects may 
in certain cases be traced to their source by regarding the 
variates of a distribution as the results of transformations 
of the variates of some other type of distribution. Edge- 
worth was prominent in thus regarding certain distribu- 
tions. For simple examples, we may think of the diame- 
ters, surfaces, and volumes of spheres that represent ob- 
jects in nature, such as oranges on a tree or peas on a 
plant. Suppose the distribution of diameters is a normal 
distribution. It seems natural to inquire into the nature 
of the distribution of the corresponding surfaces and 
volumes. The partial answer to the inquiry is that these 
are distributions of positive skewness. The same kind of 
problem would arise if we knew that velocities, Vj of mole- 
cules of gas were normally distributed, and were required 
to investigate the distribution of energies mv^/2. 

To illustrate somewhat more concretely with actual 
data it may be observed in looking over the frequency 
distributions of the various subgroups on build of men, 
in Volume I of the Medico- Actuarial Mortality Investiga- 
tion, that the distributions with respect to weight are, in 
general, not so nearly S3rmmetrical as the distributions as 
to height. In fact, the distributions as to weight exhibit 
marked positive skewness. For example, in the age group 
25 to 29 and height 5 feet 6 inches we find the following 
distribution : 

W 105 120 135 150 165 180 195 210 

F 17 722 2,175 1,346 485 155 33 3, 

Where P^= weight in pounds, F= frequency. 


A similar feature had been observed by the writer in 
examining many frequency distributions of ears of corn 
with respect to length of ears and weight of ears. The 
distributions as to weight showed this tendency to posi- 
tive skewness, whereas the distributions as to lengths of 
ears were much more nearly symmetrical. It seems natu- 
ral to assume that the weights of bodies are closely corre- 
lated with volumes. We may next take account of the 
fact that volumes of similar solids vary as the cubes of 
like linear dimensions. 

Such concrete illustrations suggest the investigation 
of the equation of the frequency curve of values obtained 
by the transformation of variates of a normal distribu- 
tion by replacing each variate x of the normal distribution 
by an assigned function of the form kx"", where ^ is a 
positive constant and w is a positive integer or the recipro- 
cal of a positive integer. A paper on this subject by the 
writer appeared in the Annals of Mathematics'" in June, 
1922. The skewness observed in the distributions of 
weights is similar to the skewness which results as the 
effect of this transformation when w is a positive constant. 

From a different standpoint S. D. Wicksell^^ in the 
Arkiv for Matematik, Astronomi, och Fysik in 1917 has 
discussed, by means of a generalized hypothesis about ele- 
mentary errors, a connection between certain functions 
of a variate and a genetic theory of frequency. The hy- 
potheses involved in this theory are at least plausible in 
their relation to certain statistical phenomena. There are 
thus at least two points of view which indicate that the 
method which uses variates resulting from transformation 
may rise above the position of a device for fitting distribu- 
tions and be given a place in the theory of frequency. A 


recent paper^^ by E. L. Dodd presents a somewhat critical 
study of the determination of the frequency law of a func- 
tion of variables with given frequency laws, and another 
recent paper^° by S. Bernstein deals with appropriate 
transformations of variates of certain skew distribu- 

26. Remarks on the use of various frequency func- 
tions as generating functions in a series representation. 
In the Handbook of Mathematical Statistics (1924), page 
116, H. C. Carver called attention to certain generating 
functions designed to make frequency series more rapidly 
convergent than the Type A series. In a paper pubhshed 
in 1924 on the ''Generalization of Some Types of the Fre- 
quency Curves of Professor Pearson" (Biometrika, pp. 
106-16), Romano vsky has used Pearson's frequency func- 
tions of Types I, II, and III as the generating functions 
of infinite series in which these types are involved in a 
manner analogous to the way in which the normal proba- 
bility function is involved in the Gram-Charlier series. 

When Type I, 



is used as a generating function, certain functions (f>k, 
which are polynomials of Jacobi in slightly modified form, 
occur in the expansion in a way analogous to that in which 
the Hermite polynomials occur in the Gram-Charlier ex- 
pansion. Moreover, the analogy is continued because 
Wo0A and (i)k form a biorthogonal system, and this prop- 
erty facilitates the determinations of the coefficients in 
the series. 

When the Type III function 

is used as a generating function, certain functions <^jfe, 
which are polynomials of Laguerre in generalized form, 
play a r61e similar to that of the polynomials of Hermite 
in the Gram-Charlier expansion. 

While it is at least of theoretical interest that various 
frequency functions may assume r61es in the series repre- 
sentation of frequency somewhat similar to the r61e of 
the normal frequency function in the Gram-CharUer 
theory, the fact should not be overlooked that the useful- 
ness of any series representation in applications to nu- 
merical data is much restricted by the requirement of 
such rapid convergence of the series that only a few terms 
need be taken to obtain a useful approximation. 


27. The meaning of simple correlation. Suppose we 
have data consisting of N pairs of corresponding variates 
fe)3'»)> ^' = 1, 2, . . . . , iV. The given pairs of values may 
arise from any one of a great variety of situations. For 
example, we may have a group of men in which x repre- 
sents the height of a man 
and y his weight ; we may 
have a group of fathers 
and their oldest sons in 
which X is the stature of 
a father and y that of his T 

oldest son; we may have 
mifiimal daily tempera- 
tures in which x is the ' • 
minimal daily tempera- 
ture at New York and y 
the corresponding value 
for Chicago; we may be considering the effect of nitrogen 
on wheat yield where x is pounds of nitrogen applied 
per acre and y the wheat yield; we may be throwing 
two dice where x is the number thrown with the first 
die and y the number thrown with the two dice together. 
If such a set of pairs of variates is represented by dots 
marking the points whose rectangular co-ordinates are 
(x, y)y we obtain a so-called "scatter-diagram." 

Assume next that we are interested in a quantitative 

Fig. 17 



characterization of the association of the x's and the cor- 
responding y^s. One of the most important questions 
which can be considered in such a characterization is that 
of the connection or correlation as it is called between the 
two sets of values. It is fairly obvious from the scatter- 
diagram that, with values of x in an assigned interval dx 
(dx small), the corresponding values of y may differ con- 
siderably and thus the y corresponding to an assigned x 
cannot be given by the use of a single- valued function of 
X. On the other hand, it may be easily shown that in 
certain cases, for an assigned x larger than the mean value 
of x^s, a corresponding y taken at random is much more 
likely to be above than below the mean value of 3;'s. In 
other w^ords, the x's and y's are not independent in the 
probability sense of independence. There is often in such 
situations a tendency for the dots of the scatter-diagram 
to fall into a sort of band which can be fairly 'well de- 
scribed. In short, there exists an important field of statis- 
tical dependence and connection between the regions of 
perfect dependence given by a single- valued mathematical 
function at one extreme and perfect independence in the 
probability sense at the other extreme. This is the field 
of correlated variables, and the problems in this field are 
so varied in their character that the theory of correlation 
may properly be regarded as an extensive branch of mod- 
ern methodology. 

28. The regressive method and the correlation sur- 
face method of describing correlation. It may help to 
visualize the theory of correlation if we point out two 
fundamental ways of approach to the characterization of 
a distribution of correlated variables, although the two 
methods have much in common. The one may be called 


the "regression method," and the other the "correlation 
surface method." 

Let us assume that the pairs of variates (x, y) are 
represented by dots of a scatter-diagram, and set the 
problem of characterizing the correlation. First, separate 
the dots into classes by selecting class intervals dx. When 
we restrict the rr's to values in such an interval dx, the set 
of corresponding y's is called an a:-array of 3''s or simply 
an array of y's. Similarly, when we restrict the assign- 
ment of y's to a class interval dy, the corresponding set 
of x's is called a y-array of x^s or simply an array of x's. 
The whole set of arrays of a variable, say of y, is often 
called a set of parallel arrays. 

The regression curve y =f{x) of y on :*: for a population 
is defined to be the locus of the expected value (§ 6) of the 
variable y in the array which corresponds to an assigned 
value of X, as dx approaches zero. In other words, the 
regression curve of y on a: is the locus of the means of 
arrays of y's of the theoretical distribution, as dx ap- 
proaches zero. 

These equivalent definitions relate to the ideal popula- 
tion from which a sample is to be drawn. The regression 
curve found from a sample is merely a numerical approxi- 
mation to the ideal set up in the definition. 

In the regression method, our first interest is in the 
regression curves of y on it; and of x on y. W.e are inter- 
ested next in the characterization of the distribution of 
the values of y (array of y's) whose expected or average 
value we have predicted. This is accomplished to some 
extent by means of measures of dispersion of the values 
of y which correspond to an assigned value of x. To illus- 
trate the regression method by reference to the correlation 


between statures of father and son, we may say that 
the first concern in the use of the regression method is 
with predicting the mean stature of a subgroup of men 
whose fathers are of any assigned height, and the next 
concern is with predicting the dispersion of such a sub- 
group. The complete characterization of the theoretical 
distributions underlying arrays of y's may be regarded as 
the complete solution of the problem of the statistical 
dependence of y on x. 

In the correlation surface method for the two vari- 
ables, our primary interest is in the characterization of the 
probability <^(:x;, y)dx dy that a pair of corresponding vari- 
ates {Xy y) taken at random will fall into the assigned 
rectangular area bounded hy x to x-\-dx and y to y-\-dy. 
This method may be regarded as an extension to func- 
tions of two or more variables of the method of theoretical 
frequency functions of one variable. To get at the mean- 
ing of correlation by this method, suppose that a func- 
tion g{x) is such that g{x)dx gives, to within infinitesimals 
of higher order, the probability that a variate x taken 
at random lies between x and x-{-dx; and suppose that 
h{x,y)dy gives similarly the probability that a variate 
y taken at random from the array of values which cor- 
respond to values of a; in the interval x to x-\-dx will lie 
between y and y-\-dy. Then the probability that the two 
events will both happen is given by the product 

(1) 4>{p0y y)dx dy^g{x)h(x, y)dx dy . 

For the probability that both of two events will happen 
is the product of the probability that the first will happen, 
multiplied by the probability that the second will happen 
when the first is known to have happened. 


Two cases occur in considering this product. In the 
first case, h{x, y) is a function of y alone. When this is the 
case we say the x and y variates are uncorrected and 
<f){x, y) is simply the product of a function of x only multi- 
plied by a fimction of y only. In such a case the proba- 
bility that a variate y will be between y and y-\-dyb, the 
same whether the corresponding assigned x be large or 
small. In the second case h{x, y) is a function of both x 
and y. In such cases, the probability that a variate y will 
be between y and y+dy is not, in general, the same for 
corresponding assigned large and small values of x. In 
such cases the two systems of variates are said to be 
correlated. Thus, in considering for example a group of 
college students, the height of a student is probably 
uncorrelated with the grades he makes in mathematics 
or with the income of his father, but his height is cor- 
related with his weight, and with the height of his father. 

Both the regression method and the correlation sur- 
face method of dealing with correlation have been in 
evidence almost from the earliest contributions to the sub- 
ject. The early method of Francis Galton was essentially 
the regression method, but the mathematical solution of 
the special problem^^ which he proposed to J. D. Hamilton 
Dickson in 1886 consisted in giving the equation of the 
normal frequency surface to correspond to given lines of 
regression. The solution of this problem thus involved 
the correlation surface method. Furthermore, the early 
contributions of Karl Pearson to correlation theory, in- 
volving the influence of selection, stressed frequency sur- 
faces*^ more than regression equations. But. beginning 
with a paper^ by G. Udny Yule in 1897, the theory has 
been developed without limitation to a particular type of 



frequency surface. It is a fact of some interest that Yule 
returned very closely to the primary ideas of Galton, by 
placing the emphasis on the lines of regression. Moreover, 
the success of the regression method of approach should 
give us an insight into the simplicity and fundamental 
character of Galton 's original ideas. 

29. The correlation coefficient. The degree of correla- 
tion is often measured by the Pearsonian coefficient of 
correlation represented by the letter r. Consider N pairs 
of variates (x,-, 3';), i = l, 2, . . . . , iV, such as are de- 
scribed above, and let (*, y) represent the corresponding 
arithmetic means of re's and y's. Then 

<^z = 

<r« = 

i=A' -11/2 r 


i=y 11/2 r 



are the standard deviations of the two series. 

Assuming that at least two of the jc's are unequal so 
that o-,=|=0, we let any variate which is denoted by :*:, in 
original units (yards, miles, pounds, dollars) be denoted 
by Xi when measured from the mean x with the standard 
deviation o-, as a unit. Similarly, let the value )f yi be 
denoted by y< when measured from the mean y with <jy as 
a unit. That is, 

Xi = {Xi-x)/(T,: , 

Then in terms of x and >'J, 
given by the simple formula 


the correlation coefficient is 



That is, the correlation coefficient of two sets of vari- 
ates, expressed with their respective standard deviations 
as units, may be defined as the arithmetic mean of the 
products of deviations of corresponding values from their 
respective arithmetic means. 

We have defined the correlation coefficient r for a 
sample. The expected value of the right-hand member of 
(2) in the sampled population is then the correlation co- 
efficient for the population. 

While the formula (2) is very useful for the purpose of 
giving the meaning of the correlation coefficient, other 
formulas easily obtained from (2) are usually much better 
adapted to numerical computation. For example, 

(3) r^ ^_ _ 




are ordinarily more convenient than (2) for purposes of 

When 'N is small, say < 30, formula (4) is readily ap- 
plied. When N is large, appropriate forms for the calcula- 
tion of r are available in various books. 

Still other forms for expressing r are useful for certain 
purposes. For example, for the purpose of showing that 
— l^r^l, we shall now give two further formulas 
for r. 

By simple algebraic verification and remembering that 


1 = ^x;VA^ = X^',- ViV, it follows that (2) may be written 
in the forms^ 

(6) r l+^'^{xi+y'i)^ . 

From these two formulas, we have the important proposi- 
tion that 

(7) -l^r^l. 


30. Linear regression. Suppose we are interested in 
the mean value yx of the ys in the a:-array of y's. The 
simplest and most important case to consider from the 
standpoint of the practical problems of statistics is that 
in which the regression of y on ic is a straight line. Assum- 
ing that the regression curve of y on ic in the population 
is a straight line, we accept as an approximation the line 
yx'^mx-^-h which fits "best" the means of arrays of the 

The term "best" is here used to mean best under a 
least-squares criterion of approximation. In applying the 
criterion the square (yx — mx — by for each array is weight- 
ed with the number in the array. Let Nx be the number 
of dots in any assigned ic-array of y's. Then the equation 
of our line of regression would be 

(8) yx==mx+b , 

where m and b are to be determined by the condition that 
the sum 

(9) j;iNx(yx-mx-b)\ 


with observed data substituted for x, yx and Nx from all 
arrays, is to be a minimum. Differentiating (9) with re- 
spect to h and m, we have 

(10) -2Y,N,{%-mx-b)^(^, 

(11) -2Y,N,{%-mx-h)x = ^ . 

We may note that Nxjx is equal to the sum of all y's in 
an array of >''s. If we examine these equations on making 
substitutions for jx and x, it is easily seen that they are, 
except for grouping errors which vanish as dx-^O^ equiva- 
lent to the equations 

(12) -2Y,{yi-mXi-h)=0 , 

(13) -2Y,oCi{yi-mxi-h) = , 

where the summation is extended to all the given pairs. 
That is, we may find the regression line by obtaining the 
linear function y = mx-\-h, which gives the best least- 
square estimate of the values of y which correspond to 
assigned values of x. Take the origin at the mean of a:'s 
and the mean of ys. Then ^yi = 0, Xx, = 0. Hence, 
from(12), Z> = 0. From (13) 

and the equation of the line of regression of y on a; is 


(14) y^r-^x. 


Similarly, the line of regression of a: on 3; is 
(15) x^r^'-^y. 

It should be remembered that the origin is at the mean 
values of :j;'s and of >''s when the regression equations take 
the forms (14) and (15). It is obvious that these equa- 
tions may be written as 


y-y = r- {x-x) 



x-x^r""/ {y-y) 

when we take any arbitrary origin. 

The coefficient rcTy/dx is called the regression coeffi- 
cient of y on X, and similarly raxi Oy is the regression co- 
efficient of X OYiy. 

If we use standard deviations as units of measurement 
the regression equations (14) and (15) become 

(18) y' = rx' , x' = ry' , 

and the regression coefficients are equal to each other and 
to the correlation coefficient. 

When there is no correlation between a;'s and y's, r = 0, 
and the regression lines of ;y on a; and of a: on y are parallel 
to the X- and y-axes, respectively. On the other hand, 
when r = 0, it is not necessarily true that there is no cor- 
relation. Indeed, there may be a high correlation^^ with 
non-linear regression when r = 0. For example, we may 
have r = when y is a simple periodic function of x. 


31. The standard deviation of arrays — mean square 
error of estimate. In passing judgment on the degree of 
precision to be expected in estimating the value of a vari- 
able, say y^ by means of the regression equation of y on x, 
it is important to have a measure of the dispersion in 
arrays of >''s. 

The mean square error s% involved in taking the ordi- 
nates of the line of regression as the estimated values of 
y may be very simply expressed by s\ = a% (1— r^). To 
prove that s% takes this value, we may write the sum of 
the squares of deviations in the form 


e, we have 





This value of 5, may be regarded as a sort of average 
value of the standard deviations of the arrays of >''s, and 
is sometimes called the root-mean-square error of estimate 
of jy or more briefly, the standard error of estimate of y. 
The factor (1 -^2)1/2 in (20) has been called the coefficient 
of alienation or the measure of the failure to improve the 
estimate of y from knowledge of the correlation. 

When the standard deviation of an array of >''s is re- 
garded as a function, say S{x), of the assigned x, the curve 
y = S{x)/(Ty is called the scedastic curve. It may be de- 
scribed as the curve whose ordinates measure the scatter 


in arrays of ^''s in comparison to the scatter of all ^''s. 
When S(x) is a constant, the regression system of y on ic 
is called a homoscedastic system. When S{x) is not con- 
stant, the system is said to be heteroscedastic. For a homo- 
scedastic system with linear regression, Sy=(Ty{\—r^y^^ is 
the standard deviation of each array of y's. 

To illustrate (20) nimierically, let us suppose that 
r = .5 gives the correlation of statures of fathers and sons. 
Assuming linear regression, the root-mean-square error 
of estimate of the height of a son derived from the as- 
signed height of the father would be 


That is, the average dispersion in the arrays of heights of 
sons which correspond to assigned heights of fathers is 
about .87 as great as the dispersion of the heights of all 
the sons. It is, therefore, fairly obvious that we cannot, 
with any considerable degree of reliability, predict from 
r = .5 the height of an individual son from the height of 
the father. However, with a large N, we can give a very 
reliable prediction of the mean heights of sons that corre- 
spond to assigned heights of fathers. 

It should be remembered that we have thus far as- 
sumed linear regression of yonx. An analogous consider- 
ation of the dispersion in arrays of :i:'s gives for the mean 
square error of estimate 

when we assume linear regression of x on y. 

32. Non-linear regression — the correlation ratio. In 
case a curve of regression, say of y on x, is not a straight 


line, the correlation coefficient as a measure of correlation 
may be misleading. In introducing a correlation ratio, 
rjyx, oi y on X as an appropriate measure of correlation 
to take the place of the correlation coefficient in such a 
situation, we may get suggestions as to what is appropri- 
ate by solving for r^ in (19). This gives 

(21) r^^l-sl/& 

where we may recall that s^ is the mean square of devia- 
tions from the line of regression. Then 


This formula could be used appropriately as a defini- 
tion of r in place of our definition in (2), and its examina- 
tion may throw further light on the significance of r. 
When ^y = 0, the formula gives r= 1 and, as we have seen 
earlier, all the dots of the scatter-diagram must then fall 
exactly on the line of regression y = rayx/dx- When Sy = ay, 
the formula gives r = 0, and the regression line is in this 
case of no aid in predicting the value of y from as- 
signed values of x. In the formula r^ = l — sl/(Tl it is im- 
portant to keep in mind that the mean square deviation 
4 is from the line of regression (§31). Next, let Sy^ be 
the corresponding mean square of deviations from the 
means of arrays. Then in the population Sy^ = sl when 
the regression is strictly linear, but Sy^=^sl when the 
regression is non-linear. This fact suggests the use of a 
formula closely related to [1— ^5/o"y]'^^ for a measure of 
non-Unear regression by replacing Sy by Sy. We then write 

(22) vlr=l-s7/al, 


where rjyx is the correlation ratio of y on x, and Sy^ is the 
mean square of deviations from the means of arrays 
whether these means are near to or far from the line 
of regression. For linear regression of y on x, we have 
Vyx = r^ in the population. 

In general, we may say that the correlation ratio of 
;y on :v is a measure of the clustering of dots of the scatter- 
diagram about the means of arrays of y's. 

An analogous discussion for the arrays of x's obviously 
leads to 

giving rjxyy the correlation ratio of x on y. 

That r)lx ^ 1 and that the equality holds only when 
all the dots in each array are at the mean of the array 
follows at once from (22). 

That rjlx"^ r^may be shown by recalling the meanings of 
4 in (21) and of Sy^ in (22). A mean square of deviations 
in each array is a minimum when the deviations are taken 
from the mean of the array. Hence, the Sy^ in (22) must 
be equal to or less than 4 in (21) for the same data, since 
the deviations in (21) are measured from the line of re- 
gression. Hence, we have shown that 

Moreover, when the regression oi y on x is linear, rjyx — ^^ 
found from the sample differs from zero by an amount 
not greater than the fluctuations due to random sampling. 
Indeed, the comparison of the quantity rjyx — rl with its 
sampling errors becomes the most useful known criterion 
for testing the linearity* of the regression of y on x. 


For some purposes, it is convenient to express the 
correlation ratios in a form involving the standard devia- 
tion of the means of arrays. For this purpose, let J, be 
the mean of any array of ^''s, and (Sy^ the standard devia- 
tion of the means of arrays when the square {yx — yf of 
each deviation is weighted with the number iV, in the 
array. Then it follov/s very simply that 

„" _ ^'2 g\ 

O y Oy 

That is, the correlation ratio oi y on x is the ratio of the 
standard deviation of the means of arrays of 3''s to the 
standard deviation of all )/'s. 

The calculation of the correlation ratio with a large 
number N of pairs may be carried out very conveniently 
as a mere extension of the calculation of the correlation 
coefficient. For a form for such c'alculation, see Handbook 
of Mathematical Statistics, page 130. 

In order to get a fair approximation to a correlation 
ratio in a population from a sample, it is important that 
the grouping into class intervals be not so narrow as to 
give arrays containing very few variates. Certain valu- 
able formulas for the correction of errors due to grouping 
have been published. ^^ 

When the regression is non-linear, the correlation may 
be further characterized by the equation of a curve of 
regression that passes approximately through the means 
of arrays of a given system of variates. As early as 1905, 
the parameters of the special regression curves given by 
polynomials y =f{x) of the second and third degrees were 
determined in terms of power moments and product 


moments. In 1921, Karl Pearson^ published a general 
method of determining successive terms of the regression 
curve of the form 

(24) y=fix) = ao\po+ai\pi-{- • • • • +Cn\A» , 

where ^o, fli, . . . . , On are constants to be determined 
and the functions ips form an orthogonal system of func- 
tions of X. That is, 

^(^-^^.^ = 0, 

if the summation ^ be taken for all values of x correspond- 
ing to a system of arrays with frequency in an :r-array 
given by Nx. An exposition of the theory of non-linear 
regression curves is somewhat beyond the scope of this 

33. Multiple correlation. Thus far we have considered 
only simple correlation, that is, correlation between two 
variables. But situations frequently arise which call for 
the investigation of correlation among three or more vari- 
ables. A familiar example occurs in the correlation of a 
character such as stature in man with statures of each 
of the two parents, of each of the four grandparents, and 
possibly with statures of others back in the ancestral line. 
Other examples can be readily cited. Indeed, it is very 
generally true that several variables enter into many prob- 
lems of biology, economics, psychology, and education. 

The solution of these problems calls for a development 
of correlation among three or more variables. Suppose we 
have given N sets of corresponding values of n variables 
Xi, Xij . . , . J Xn. Assume next that we separate the val- 


ues of Xi into classes by selecting class intervals fe, dxz, 
. . . . , dxn of the remaining variables. When we limit the 
it:2's to an assigned interval dx2, Xs^s to an assigned interval 
dxs, and so on, the set of corresponding iCi's is sometimes 
called an array of 0:1 's. 

The locus of the means of such arrays of Xis in the 
theoretical distribution, as dx2, dxz, . . . . , dXn approach 
zero, is called the regression surface of Xi on the remaining 
variables. It will be convenient to assume that any vari- 
able, Xj, is measured from the arithmetic mean of its N 
given values as an origin. Let dj be the standard devia- 
tion of the N values of Xj, and let rpq be the correlation 
coefficient of the N given pairs of values of Xp and Xq. 
Then we seek to determine hn, bn, . . . . , bin, c, the para- 
meters in the linear regression surface, 

(25) Xi = bi2X2 + bizX3+ • • • • +binXn + C , 

of Xi on the remaining variables so that Xi computed from 

(25) will give on the whole the "best" estimates of the 
values of Xi that correspond to any assigned values of 
X2, 0^3, .... , Xn. Adopting a least-squares criterion, we 
may determine the coefficients in (25) so that 

(26) U=J2(^^-^i2X2-bi3X3- • • • • -binXn-Cy 

shall be a minimum. This gives for the linear regression 
surface of Xi on X2, Xz, . . . . , Xn, 

q = 2 



where Rpq is the cofactor of the ^th row and the ^th 
column of the determinant 


1 ^12 ri3 . . . . Tin 

^21 1 ^23 . . . . r^n 

rz^ r^ 1 .... 

^nl ^n2 


For simplicity we shall limit ourselves to n = 3 in giv- 
ing proofs of these statements, but the method can be 
extended in a fairly obvious manner from three variables 
to any number of variables. 

Equating to zero the first derivatives of V in (26) 
with respect to c, bn, and bis, we obtain, when n = 3, the 

c = 0, 

V — 223 '^2(a ~ ^i2-'^2 - bizXz) = . 

— 2 ^:r3(xi — bi2X2 — buXz) = , 

The last two equations may be written in the form 

^a:i.T2 - bi2^xl — biz^X2Xz = , 

^XiXz — bi2Ylx2X3 - ^13X^3 = . 

By expressing the summations in terms of standard de- 
viations and correlation coefficients, we have 


Nbl2(Tl + Nbizr23cr2<Ts = Nri2Cri(T2 , 
Nbi2r23<^2az-i-Nbiz(Tl = Nrizaiffs , 

Solving for bn and bu, we obtain 


1^12 ''23 

''12 ''23 

*"- dot 

ri3 1 
1 r23 

— ^ 

ri3 1 

1 ''23 

^23 1 

rj3 1 

h -^1 

t>13 = — 










^^^ Kn (Tq 

where i?^ is the cofactor of the pih. row and gth column of 

i? = 

If the dispersion (scatter) (ri.23 . ...» of the observed 
values of Xi from its corresponding computed values on 
the hyperplane (27) is defined as the square root of mean 
square of the deviations, that is, 











<^1J23 n 

-T^ ^^ (observed Xi — computed XiYj 

then it can be proved that 

(32) (Ti^.. .n = (Ti{R/Rny^^ 


To prove this for w = 3, we may write from (27) and (31) 

— "^^ (-^11 + -^12 "i'-'^lS "I" 2i?lli?12''l2 + 2RiiRizri3 + 2i?i2i?l3''23) 


^2 [RniRu-\-ri2Ri2-\-rizRi3) +-^i2(-Ki2+ ''12^11 +''23^13) 

-{•Ri3{Ri3-^rizRn-\-r2^i2)] • 
Since from elementary theorems of determinants, 

•^11 + ''12^12 + ''13^13 = -R , 
i?12 + ''l2^11 + ''23^13 = , 
/^IsH- ''13^11 + ''23^12 = , 

we have 

(33) al2^ = a? R/Rn , (Ti^ = (TiiR/RnY^^ . 

As an extension of the standard error of estimate with 
two variables (p. 87), it is true for n variables that the 
standard error 0-1.23 .... n of estimating 0:1 from assigned 
values of X2, X3, . . . . , Xn is the standard deviation of 
each array of XiS, provided all regressions are linear and 
the standard deviation of an array of x/s is the same for 
all sets of assignments of X2, Xz, . . , . , Xn. 

Next, we shall inquire into the dispersion of the esti- 
mated values given by (27) . Since the mean value of these 
estimates is zero, when the origin is at the mean of each 


system of variates, we have the standard deviation aiE ot 
the estimates of Xi given by 

= _£l|,^,+,^,} = <r?(l-A) 

The correlation coefficient ri.23 n between the ob- 
served values of Xi and its corresponding estimated values 
calculated from the linear function (27) of X2, 0^3, .. , Xn 
is called the multiple correlation coefficient of order n—\ 
of Xi with the other n — \ variables. The multiple correla- 
tion coefficient ri.23 . . . . « is expressible in terms of simple 
correlation coefficients by the formula 

(34) ri^....» = [l-i?/i?iiP/2. 

To prove (34), limiting ourselves to w = 3, we write 

„ * 2'V^^i/ R\ix^ R\iXi\ 

iV<ri(7i£ri.23 = <Tt > -I -^ r-~ p" 7) 

-^^criX Kii (72 All 0-3/ 



{Ri2ru-\- Ruris) 


(ri£ = (Ti[l-/?MiJi/% 
we have the result sought, 


The relation (34) is very significant because it enables us 
to express multiple correlation coefficients in terms of sim- 
ple correlation coefficients. 

From equations (32) and (34), it follows that 

(35) <Ti.23 .... n = <^l(l~"''l.23 . . . . n) • 

34. Partial correlation. It is often important to obtain 
the degree of correlation between two variables Xi and x^ 
when the other variables Xs, Xi, . . . . , Xn have assigned 
values. For example, we might find the correlation of 
statures of fathers and sons when the stature of the moth- 
er is an assigned constant, say 62 inches. In general, sup- 
pose we have found a correlation between characters A 
and B, and that it is a plausible interpretation that the 
correlation thus found is due to the correlation of each 
of them with a character C. In this case we could remove 
the influence of C, if we had a sufficient amount of data, 
by restricting our data to a universe of A and B corre- 
sponding to an assigned C. 

In accord with this notion, we may define a partial 
correlation coefficient r'12.34 . . . .nOi Xi and X2 for assigned 
Xz, Xi, , , . . , Xn, SiS the correlation coefficient of Xi and 
X2 in the part of the population for which X3, x^y . . . . , 
Xn have assigned values. A change in the selection of as- 
signed values may lead to the same or to different values 

of r 12.34 ...... 

Suppose we are dealing with a population for which 
the regression curves are straight lines and the regression 
surfaces are planes. Thus, let us assume that the theo- 
retical mean or expected values of Xi and X2 for an as- 
signed X3, Xiy . . . . , Xn are 

613^3 + ^14:^4+ • • • • -f binXn , 
b23X3-h b2iXi-{- ' • ■ -]-b2nXn , 


respectively. Then a partial correlation coefficient 
r' .... n is the simple correlation coefficient of residuals 


^.34 .... n=X2 — b23X3—b2iX4— ' ' ' ' — 62n^n 

limited to the part of the population Nz4 . . . n of the 
total N for which Xz, Xi, . . . . , Xn are fixed. 

Suppose further that the population is such that any 
change in the assignment of values to Xs, Xi, . . . . , Xn 
does not change the standard deviation of 0:1.34 ... „. 
nor of Ji:2.34 ....«, nor the value of /^.u ....«• Such a 
population suggests that we define 

(36) ri2.34 ... - 

Nai.zi . . n C2.34 


where the summation is extended to N pairs of residuals, 
as Ihe partial correlation coefficient of Xi and Xy for all sets 
of assignments oi x^ X4, . . . . , x^. 

If the population is such that r'12.34 ... n is not the 

same for eachViiflerent set of assignments oix3,X4, ,Xn, 

the right-hand member of (36) may still be regarded as 
a sort of average value of the correlation coefficients 
of Xi and X2 in subdivisions of a population obtained by 
assigning X3, X4, . . . . , Xn, or it may be regarded as the 
correlation coefficient between the deviations of Xi and 
X2 from the corresponding predicted values given by their 
linear regression equations on Xz, X4, . . . . a:„. 

The partial correlation coefficient as given in (36) is 



expressible in terms of simple correlation coefl&cients by 
the formula 


^12.34 . . . . n — 


11/2 > 

where Rpq is a cof actor defined in ^33. 

We may prove (37), limiting ourselves ton = 3y as follows: 

By definition 

^12.3 = 

A^0"1.3 (72.3 -^O'l.S 0r2.3 

^:x:iX2 — /-IS —2:^2^3 — ^23 —^ ^1:^^3+^13^23 "^^^ 

[2 (*'-^'v! ^')'2 (*^-^- 7I *')! 


yi2~ ^13^23 


[(l-n3)(l-r^3)]^/2 [i2lli?22]l/2 

Thus, (37) is proved for n — 3. 

An important relation between partial and multiple 
correlation coefficients may now be derived. From (37) 
we have 

t 2 

1 — ^"12.34 

RnR22~ -f^l2 


By a well-known theorem of determinants,^ 

^11 ^12 
R12 R22 

= RiiR22—Ri2=RRu 22 

Hence we have 

RRu22 Ru 1-^1 

1 — ^12 34 ...••■p 5"" P„„ 1 2 

All -tV22 ^22 1 — ri 34 


since from (32) and (35), 

and similarly 

R . 2 

•^-=1— ''1.23 ..n> 

•^22 4 J 

= 1— ^^1.34.. .« 



Thus we can express the partial correlation coefficient 

;'i2.34 „ of order n-2 (the number of variables held 

constant) in terms of the multiple correlation coefficient 
^1.23 „ of order w~l and the multiple correlation co- 
efficient ri.34 n of order n-2. 

35. Non-linear regression in n variables— multiple 
correlation ratio. The theory of correlation for non-linear 
regression lends itself to extension to the case of more 
than two variables as has been demonstrated by the con- 
tributions of L. Isserlis^^ ^nd Karl Pearson.*^ 

Consider the variables Xi, Xz, . . . . y Xn, and fix atten- 
tion on an array of Xi's which corresponds to assigned 

values of x^, xz, , ^«. Next, let *i.23 . . n be the 

mean of the values in the array of :x:i's and let 0-1.23 . . . . n 
be the standard deviation of these means of arrays of a^i's, 
where the square of each deviation Xi.23 . . . . n from the 
mean of o^i's is weighted with the number in the array in 


finding this standard deviation. Then the multiple cor- 
relation ratio 7/1.28 » of iCi on :jc2, ^3, . . . . , ^» may be 

defined by 


f2H\ J - ^^•'^ ** 

\PQ) 1?1.23 . . . » — 2 

The analogy with the case of the correlation ratio for 
two variables seems fairly obvious. While the method of 
computing the multiple correlation ratio 7/1.23 .... n is 
simple in principle, it is unfortunately laborious from the 
arithmetic standpoint. 

36. Remarks on the place of probability in the regres- 
sion method. Thus far we have discussed simple correla- 
tion by the regression method without using probabilities 
in explicit form. To be sure, probability theory is in- 
volved in the background. It seems fairly obvious that 
it would be of fundamental interest to construct urn 
schemata which would give a meaning to the correlation 
and regression coefficients in pure chance. In a paper** 
published by the author in 1920, certain urn schemata 
were devised which give linear regression and very simple 
values for the correlation coefficient. Other schemata ap- 
parently equally simple give non-linear regression. The 
general plan of the schemata consists in requiring certain 
elements to be common in successive random drawings. 
It appears that the construction of such urn schemata 
will tend to give correlation a place in the elementary 
theory of probability. 

In a recent book*' by the Russian mathematician, 
A. A. Tschuprow, an important step has been taken to- 
ward connecting the regression method of dealing with 


correlation more closely with the theory of probability. 
This is accomplished by a consideration of the under- 
lying definitions and concepts for a priori distribu- 

It may be noted that we have not based our develop- 
ment of the regression method on a precise definition of 
correlation. Instead we have attempted a sort of genetic 
development. It may at this point be helpful in forming 
a proper notion of the scope and limitations of the regres- 
sion method to give a definition of correlation from the re- 
gression viewpoint. It seems that a general definition will 
involve probabilities because we shall almost surely wish 
to idealize actual distributions into theoretical distribu- 
tions or laws of frequency for purposes of definition. In a 
general sense, we may say that y is correlated with x 
whenever the theoretical distributions in arrays of y\ are 
not identical for all possible assigned values of x, and we 
say that y is uncorrected with x whenever the theoretical 
distributions in arrays of y's are identical with each other 
for all possible values of x. By the identity of the theo- 
retical distributions in arrays of >''s, we mean that they 
have equal means, standard deviations, and other par- 
ameters required to characterize completely the distribu- 
tions. It is fairly obvious that our discussion of the re- 
gression method is incomplete in a sense because we have 
not given a complete characterization of distributions in 
arrays. Our characterization of the statistical depend- 
ence of y on ic may be regarded as complete when the 
arrays of y's are normal distributions, because the dis- 
tributions are then completely characterized by their 
arithmetic means and standard deviations. 



37. The normal correlation surfaces. The function 

2=/fe, :r2, . . . . , Xn) 

is called di frequency function of the n variables, Xi,a;2, . . ^x^ 

zdxi dX2 .... dxn 

gives, to within infinitesimals of higher order, the proba- 
bihty that a set of values of Xi, X2, . . . . , Xn taken at 
random will lie in the infinitesimal region bounded by Xi 
and Xi-i-dxi, X2 and X2-]-dx2, . . . . y Xn and x„-{-dXn. 
When the variables are not independent in the proba- 
bility sense, the surface represented by s =f(xi, X2, . , , Xr,) 
is called a correlation surface. 

With the notation of § 29 for simple correlation, the 
natural extension of the theory underlying the normal 
frequency function of one variable to functions of two 
variables x and y leads to the correlation surface 




1 /«« y« 2rxy \ 

Moreover, with the notation of § 2f2> on multiple corre- 
lation the natural extension to the case of a function of n 
normally correlated variables Xi^ X2, . . . . , Xn gives a 
frequency function of the exponential type 

where is a homogeneous quadratic function of the n 
variables which may be written in the form 

R\ U\ q\ (Tj 0-2 / 


the determinant R and its cofactors Rpp and Rpq being 
defined in § 33. We thus have a correlation surface in 
space of « + 1 dimensions. 

For purposes of simplicity we shall limit our deriva- 
tions of normal frequency functions to functions of two 
and three variables thus restricting the geometry involved 
to space of three and four dimensions. 

The equation of the normal frequency surface may be 
derived from various sets of assumptions analogous to and 
extensions of sets of assumptions from which the normal 
frequency curve may be derived. Some of these deriva- 
tions make no explicit use of the fact that in normal 
correlation the regression is Knear. That is, linear re- 
gression is considered as a property of the frequency sur- 
face obtained from other assumptions. But we may con- 
nect the frequency-surface method closely with the re- 
gression method by involving linear regression of one of 
the variables on the others as one of the assumptions from 
which to derive the surface. This is the plan we shall 
adopt in the following derivation. Let us assume, first, 
that one set of variates, say the x's, are distributed 
normally about their mean value taken as an origin. 
Then in our notation (p. 47 and § 29) 

(39) ^-^, e-^. dx , 

to within infinitesimals of higher order, is the probability 
that an x taken at random will lie in the interval dx. 
Assume next that any array of y's corresponding to an 
assigned it: is a normal distribution with the standard 
deviation of an array given by <Ty(\—r^y^'^ as found earlier 
in this chapter (§31), and finally, assume that the re- 


gression of ^^ on a; is linear. Then in the notation of simple 

is, to within infinitesimals of higher order, the probability 
that a y taken at random from an assigned array of ^''s 
will lie in the interval dy. 

By using the elementary principle that the probability 
that both of two events will occur is equal to the product 
of the probabilities that the first will occur and that the 
second will occur when the first is known to have occurred, 
we have the product z dxdy of (39) and (40) for the proba- 
bility, to within infinitesimals of higher order, that x will 
fall in dx and the corresponding y in dy, where 

1 1 / a;« y2 2rxy \ 

is the normal correlation surface in three dimensions. 

Let us turn next to the derivation of the normal corre- 
lation surface in four dimensions. Following the notation 
of multiple correlation we seek a normal frequency func- 

Z=f{Xl, X2y Xi) . 

We shall assume first that pairs of the variates, say 
of X'iS and XzS, are normally distributed. Then by what 
has just been demonstrated about the form of the correla- 
tion surface in three dimensions, the expression 

1 X2^ . xi 

*2 ^i\ 

(42) r. — vTTTo «"2^r:^ W'^o^^"^'" ^. W ^^ ^^8 



is, to within infinitesimals of higher order, the probability 

that a point fe, ^3) taken at random lies within the area 

dxidxz. We next assume that the regression of Xi on x^ and 

Xi is linear, and that each array of Xi's corresponding to 

an assigned (x2, x^) is a normal distribution with standard 


<ri.23 = (7iWi?ii)i/2 

given by (32). 

Then in the notation of multiple correlation, the prob- 
ability that a variate taken at random in an assigned 
{x2, :r3)-array of Xi?> will lie in dx^ is given, to within infini- 
tesimals of higher order, by 

-/^ ^^TTT^ e 2/20-, \ * * ^11 0-2 ^u 0-3/ dxi 




Then the probability that a point {xi, X2, x^ taken at 
random will lie in the volume dxidxzdxz is given, to within 
infinitesimals of higher order, by the product of (42) and 
(43). This gives, after some simplification, for the proba- 
bility in question, z dxidx^dxz, where 




~(2»)3/2 7?i/2<r,„2<r,^ 








(7i (73 (72 (73/ 


38. Certain properties of normally correlated dis- 
tributions. The equal-frequency curves obtained by mak- 
ing z take constant values in equation (41) are an infinite 
system of homothetic ellipses, any one of which has an 
equation of the form 

The area of the ellipse is 


and the semiaxes are given by a = k\ and b = y\, where 
k and k' are functions of o-,, Cy, and r. The probability 
that a point {x, y) taken at random will fall within any 
ellipse obtained by assigning X is given by 

(45) ^TTCT. cr, I " e-2(ib) ^' X JX = 1 - e-2(rbi) 


Attention has often been called to the equal frequency 
ellipse known as the "probable" ellipse. The probable 
ellipse may be defined as that ellipse of the system such 
that the probability is 1/2 that a point (x,y) of the 
scatter-diagram (see Fig. 18, p. 109) lies within it. This 
means by (45) that 


2a-rr.= i ^ X2=1.3863 (l-r^) . 


From (45) it follows that [X/(l -r^)] e ^(i-^^) AX gives, 
to within infinitesimals of higher order, the probability 



that a point (x, 3;) taken at random will fall in a small 
ring obtained by taking values of X in AX. 

We may determine the ellipse^^ along which, for a 
given small ring AX, we should expect more points {x, y) 
than along any other ellipse of the system. For a constant 

AX, the probability is a maximum when X^ = 1 — r^. Hence, 
what may be called the ellipse of maximum probabiUty 

x" y^ 2rxy_ 

(Jr. CTjy (Tx (Tt 


To illustrate the meaning of this ellipse, we may say 
that in Bertrand's illustration of shooting a thousand 
shots at a target, the probability is greater that a shot 
will strike along this ellipse than along any other elHpse 
of the system. It is an interesting fact that -the ellipse 
of maximum probability is identical with the orthogonal 
projection of parabolic points of the correlation surface 
on the plane of distribution. To prove this theorem, we 


simply find the locus of parabolic points on the surface 
(41) by means of the well-known condition 

This gives 

dx^ dy^ \dxdy) 

ffx (^y ^x ^y 

which establishes the theorem. 

By comparing X2 = l-r2 with X2 = 1.3863 (l-r^), we 
note that the probable ellipse is larger than the ellipse of 
maximum probability. For the statures of 1 ,078 husbands 
and wives, the two ellipses just discussed are shown on the 
scatter-diagram in Figure 18. By actual count from the 
drawing (Fig. 18), it turns out that 536 of the 1,078 points 
are within the probable ellipse and 412 are within the 
ellipse of maximum probability. These numbers differ 
from the theoretical values by amounts well within what 
should be expected as chance fluctuations. 

Another interesting problem in connection with the 
correlation surface relates to the determination of the 
locus along which the frequency or density of points on 
the plane of distribution (scatter-diagram) bears a simple 
relation to the corresponding density under independence. 
Thus, we seek the curve along which dots of the scatter- 
diagram are k times as frequent as they would be under 
independence where ^ is a constant. Equating z in (41) 
to k times the corresponding value of z when r = in (41), 
we obtain after slight simplification the hyperbola 
(Fig. 18) 


Karl Pearson dealt*^ with this curve for ^ = 1. That is, 
he considered the locus along which the density of points 
of the scatter-diagram is the same as it would have been 
under independence. The fact that the density of dis- 
tribution at the centroid in (41) is 1/(1— r2)i/2 times as 
much as it would be under independence naturally sug- 
gests the study of the locus of all points for which 
/fe = l/(l-r2)i/2 in (46). It turns out that in this case 
the hyperbola degenerates into straight lines 

r (Tx 

These lines are shown as lines AB and CD on Figure 18. 
They separate the plane of distribution into four com- 
partments such that one-fourth is the probability that a 
pair of values (x,y) taken at random will give a point 
falling into any prescribed one of these compartments. 
Although no further discussion of the properties of 
normal correlation surfaces will be attempted in this 
monograph, certain properties analogous to those men- 
tioned for the surface in three dimensions would follow 
rather readily in the case of the surfaces in higher dimen- 
sions. Thus the system of ellipsoids of equal frequencies 
has been studied to some extent."*^ In a paper by James 
McMahon,^® the connection between the geometry of the 
hypersphere and the theory of normal frequency func- 
tions of n variables is established by linearly transforming 
the hyperellipsoids of equal frequency into a family of 
hyperspherical surfaces, and by applying the formulas of 
hyperspherical goniometry to obtain theorems in multiple 
and partial correlations. 


39. Remarks on further methods of characterizing 
correlation. In bringing to a conclusion our discussion of 
correlation, it may be of interest to point out a few of the 
limitations and omissions in our treatment, and to give 
certain references that would facilitate further reading. 

We have not even touched on the methods of dealing 
with correlation of characters which do not seem to admit 
of exact measurement, but admit of classification; for 
example, eye color, hair color, and temperament may be 
regarded as such characters. Such characters are some- 
times called qualitative characters to distinguish them 
from quantitative characters. The correlation between 
two such characters has been dealt with in some cases by 
the method of tetrachoric^^ correlation, in other cases by 
the method of contingency,^^ and by the method of cor- 
relation in ranks^^ in cases where the items are ordered 
but not measured. We have not touched on the methods 
of dealing with correlation''^ in time series — a subject of 
much importance in the methodology of economic statis- 
tics. The methods and theories of connection and con- 
cordance of Gini^° for dealing with correlation have been 
omitted. No discussion has been given of the fundamental 
work of Bachelier^i on correlation theory in his treatment 
of continuous probabilities of two or more variables. Our 
discussion of frequency surfaces in §37 is limited to normal 
correlation surfaces. The way is, however, fairly clear for 
the extension^^ of the Gram-Charlier system of representa- 
tion to distributions of two or more variables which are 
not normally distributed. 

While great difficulties have been encountered in the 
past thirty years in attempts to pass naturally from the 
Pearson system of generalized frequency curves to analo- 


gous surfaces for the characterization of the distribution 
of two correlated variables, it is of considerable interest 
to remark that substantial progress has been made re- 
cently on the solution of this problem by Narumi,^^ Pear- 
son," and Camp.^ 

Although the many omissions make it fairly obvious 
that our discussion is not at all complete, it is hoped that 
enough has been said about the theory of correlation to 
indicate that this theory may be properly considered as 
constituting an extensive branch in the methodology of 
science that should be further improved and extended. 


40. Introduction. In Chapter II we have dealt to some 
extent with the effects of random sampling fluctuations 
on relative frequencies. But it is fairly obvious that the 
interest of the statistician in the effects of sampling fluc- 
tuations extends far beyond the fluctuations in relative 
frequencies. To illustrate, suppose we calculate any sta- 
tistical measure such as an arithmetic mean, median, 
standard deviation, correlation coefficient, or parameter 
of a frequency function from the actual frequencies given 
by a sample of data. If we need then either to form a 
judgment as to the stability of such results from sample 
to sample or to use the results in drawing inferences 
about the sampled population, the common-sense process 
of induction involved is much aided by a knowledge of 
the general order of magnitude of the sampling discrep- 
ancies which may reasonably be expected because of the 
limited size of the sample from which we have calculated 
our statistical measures. 

We may very easily illustrate the nature of the more 
common problems of sampling by considering the deter- 
mination of certain characteristics of a race of men. For 
example, suppose we wish to describe any character such 
as height, weight, or other measurable attributes among 
the white males age 30 in the race. We should almost 
surely attempt merely to construct our science on the 
basis of results obtained from the sample. Then the ques- 



tion arises : What is an adequate sample for a particular 
purpose? The theory of sampling throws some light on 
this question. The development of the elements of a the- 
ory of sampling fluctuations in various averages, coeffi- 
cients, and parameters is thus of fundamental importance 
in regarding the results obtained from a sample as ap- 
proximate representatives of the results that would be 
obtained if the whole indefinitely large population were 

One of the difficult and practical questions involved 
in making statistical inquiries by sample relates to the 
invention of satisfactory devices for obtaining a random 
sample at the source of material. A result obtained from 
a sample unless taken with great care may diverge signifi- 
cantly from the true value characteristic of the sampled 
population. For example, the writer had an experience 
in attempting to pick up a thousand ears of Indian com 
at random with respect to size of ears. It soon appeared 
fairly obvious that instinctively one tended to make 
''runs" on ears of approximately the same size. The sam- 
ple would probably not be taken at random when thus 
drawn. Such systematic divergence from conditions nec- 
essary for obtaining a random sample is assumed to be 
eliminated before the results that follow from the theory 
of random sampling fluctuations are applicable. In the 
practical applications of sampling theory, it is thus im- 
portant to remember that the conditions for random 
sampling at the source of data are not always easily ful- 
filled. In fact, it seems important in certain investigations 
to devise special schemes for obtaining a random sample. 
For example, we may sometimes improve the conditions 
for drawing a random sample of individuals by the use 


of a ball or card bearing the number of each individual 
of a much larger aggregate than the sample we propose 
to measure and by then drawing the sample by lot from 
such a collection of balls or cards after they have been 
thoroughly mixed. Even with urn schemata containing 
white and black balls thoroughly mixed, it must be as- 
sumed further that one kind of balls is not more slippery 
than another if slippery balls evade being drawn. The 
appropriate devices for obtaining a random sample de- 
pend almost entirely on the nature of the particular field 
of inquiry, and we shall in the following discussion simply 
assume that random samples can be drawn. 

In an inquiry by sample, the following fundamental 
question comes up very naturally about any result, say 
a mean value ic, to be obtained from a sample of 5 Indi- 
viduals: What is the probability that x will deviate not 
more numerically than an assigned positive number 5 
from the corresponding unknown true value x that would 
be given by using an unlimited supply of the material 
from which the 5 variates are drawn? This question pre- 
sents difficulties. An ideal answer is not available, but 
valuable estimates of the probability called for in this 
question may be made under certain conditions by a 
procedure which involves finding the standard deviation 
of random sampling deviations. 

For the unknown true value f referred to above, con- 
tinental European writers very generally use the mathe- 
matical expectation or the expected value of the variable 
(cf. § 6). In what follows, we shall to some extent adopt 
this practice and shall find it convenient to assume the 
following propositions without taking the space to demon- 
strate them: 


I. The expected value E [x — E{x)\ of deviations of a 
variable from its expected value E{x) is zero. 

II. The expected value of the sum of two variables is 
the sum of their expected values. That is, E{x-\-y) = 

III. The expected value of the product of a constant 
and a variable is equal to the product of the constant by 
the expected value of the variable. That is, E{cx) = cE{x) . 

IV. The expected value of the product xy of corre- 
sponding values of two mutually independent variables 
X and y is equal to the product of their expected values, 
where we call x and y mutually independent if the law of 
distribution of each of them remains the same whatever 
values are assigned to the oth6r. 

V. In particular, if x and y are corresponding devia- 
tions of two mutually independent variables from their 
expected values, the expected value of the product xy is 
zero. It is fairly obvious that V follows from I and IV. 

It is convenient in the discussion of random sampling 
fluctuations to deal with the problem of the distribution 
of results from samples of equal size. To give a simple 
example, let us conceive of taking a random sample con- 
sisting of 1 ,000 men of a well-defmed race in which some 
character is measured giving us 1,000 variates. Next, sup- 
pose we repeat the process until we have 1,000 such sam- 
ples of 1,000 men in each sample. Then each of the sam- 
ples would have its own arithmetic mean, median, mode, 
standard deviation, moments, and so on. Consider next 
the 1,000 results of a given kind, say the 1,000 arithmetic 
means from the samples. They would almost surely differ 
but slightly from one another in comparison with differ- 
ences between extreme individual variates. But if the 


measurements are reasonably accurate the means would 
differ and form a frequency distribution. This frequency 
distribution of means would have its own mean (mean of 
means) and its own standard deviation. We are especially 
interested in such a standard deviation, for it may be 
taken as an approximate measure of the variabihty or 
dispersion of means obtained from different samples. This 
standard deviation (standard error) would no doubt be a 
fairly satisfactory measure of sampling fluctuations for 
certain purposes. 

Although the process of finding mean values from each 
of a large number of equal samples with a large number 
of individuals in each sample gives us a useful conception 
of the problem of sampling errors in mean values, it would 
ordinarily be a laborious and usually an impractical task 
because of paucity of available data to carry out such a 
set of calculations. The statistician ordinarily obtains a 
result from a sample by calculation, say a mean value Xy 
and then investigates the standard deviation of such re- 
sults without taking further samples. That such a treat- 
ment of the problem is possible is clearly an important 
mathematical achievement. 

The space available in the present monograph will not 
permit the derivation of formulas for the standard devia- 
tion of sampling errors in many types of averages or 
parameters. In fact, we shall limit ourselves to presenting 
only sufficient derivations of such formulas to indicate the 
nature of the main assumptions and approximations in- 
volved in the rationale which supports such formulas, and 
certain of their interpretations. Preliminary to deriving 
formulas for standard deviations of sampling errors in 
certain averages and parameters, we need to find the 


Standard deviation and correlation of errors in class fre- 
quencies of any given frequency distribution. For brevity 
we shall use the expression "standard error" in place of 
"standard deviation of errors." 

41. Standard error and correlation of errors in class 
frequencies. Suppose we obtain from a random sample of 
a population an observed frequency distribution 

/l,/2, ....,/<,.... ,/» 

with a number ft of individuals in a class t, and with a 
total of /1+/2+ .... +/n = -^ individuals observed in the 

Suppose next that we should obtain a large number of 
such samples of s observations each taken under the same 
essential conditions. A class frequency ft will vary from 
sample to sample. These values// of will form a frequency 
distribution. We set the problem of expressing the ex- 
pected value of the square of the standard deviation oy^ 
in terms of observed values. 

To solve this problem, we may consider that any ob- 
servation to be made is a trial, and that it is a success to 
obtain an observation for which the individual falls in 
the class /. Let pt be the probability of success in one 
trial, and qt=l—pt be the corresponding probability of 

In sets of 5 trials with a constant probabihty pt of 
obtaining an individual in the class t, we have from page 
27 that the square of the standard deviation of /< in the 
theoretical distribution is given by 

(1) (rit^sptqt = spt(l-pt). 


In statistical applications, we do not ordinarily know 
the exact value of pty but accept the relative frequency 
ft/s as an approximation to /?« if 5 is large. If we thus 
accept ft/s as an approximation to pt^ and substitute 
p^=J^/s in (1), we obtain 

(2) 4=/.(i-/.A) 

as an approximate value of the square of ay^ conveniently 
expressed in terms of observed frequencies. 

The value (2) is regarded as an appropriate approxi- 
mation to the value of (1) because (1) may be obtained 
from (2) by replacing the quotient ft/s by its expected 
value pt. It is usually agreed among statisticians, how- 
ever, that a better approximation to (1) would be an ex- 
pression which as a whole has the second member of (1) 
as its expected value. The expected value of the product 
ft(l-ft/s) is not the product spt{l-pt) of the expected 
values of its factors, as we shall see in the next paragraph. 
It will be found that the second member of the equation 

(3) ''h=7^A'-i) 

has spt{^—pt) as its expected value, and (3) is therefore 
regarded as a better approximation than (2) for express- 
ing (1) in terms of observed frequencies. The reason for 
the advantage of formula (3) over formula (2) is the sub- 
ject of frequent inquiries by students of statistics, and 
it is hoped that the discussion here given will contribute 
to answering such inquiries. 

In accordance with the principle just stated it will 
be seen that the error introduced by replacing sptil — pt) 


hy ft{\— ft/ s) involves not only sampling errors, but also 
a certain systematic error. Thus, although the expected 
value of /< is spt (p. 26) and the expected value of 1 —Jt/s 
is !—/><, we shall see as stated above that the expected 
value of the product /<(! —fi/s) is not equal to the product 
spt{l — pt) of the expected values, but is in fact equal to 
{s—l)pt{l — pt)- We may prove this by first expressing 
(1), with the help of the definition of oy^, in the form 

(4) E[if,-spd'] = sp,{l-pd , 

and then applying the last proposition on page 21 which 
states that the expected value of the square of the vari- 
able X is equal to the square of the expected value of x 
increased by the expected value of the square of the de- 
viations of X from its expected value. Thus, for a variable 
X =ft with an expected value spt, we write 

E{ft') = s'Pt'+El{ft-spty] = sW+sp,{l-pd 

from (4). Further, 



= spt-spt^-ptil-pt) = {s-l)pt{l-pt) . 

By multiplying both members of (5) by s/{s—l), we may 


Thus, in approximating to the value spt(l — pt) in the 
right member of (1) by means of a function of the ob- 


served /<, we note that the function sft(l—ft/s)/{s — l) 
has the expected value spt(l—pt) which we seek, and that 
/«(1 —ft/s) given in the right member of (2) as an approxi- 
mation to spt{l ~pt) contains a systematic error. 

In finding standard errors in means, moments, correla- 
tion coefficients, and so on, it is important to know the 
correlation between deviations of frequencies in any two 
classes. Let dft be the deviation of ft from the theoretical 
mean or expected value of the class frequency in taking 
a random sample of 5 variates. Then since /i -1-/2+ • • • • 
+/<+ • • • • +fn = s = si constant, we have 

(6) 5/14-5/2+ • • • • +8ft+ • • • • +5/n = . 

If our sample has given 8ft more than the expected 
number in the class /, it may reasonably be assumed that 
a deficiency equal to —8ft will tend to be distributed 
among the other groups in proportion to their expected 
relative frequencies. 

Now suppose we had a correlation table made of pairs 
of values of 5/^ and 5/^, obtained from a large number of 
samples. Consider the array in which 8ft has a fixed value 
By (6), for each sample, 

-5/^ = 5/1+5/2+ .... +5/e_i+5/,+i+ • . . . +5/„ . 

Assume that the amount of frequency in the left mem- 
ber of this equality is distributed to terms of the right 
member in such proportion that, for a fixed 5/i, the mean 
value of 8ft' is 

This gives the mean of the array under consideration. 


It is fairly obvious that the correlation coefl&cient 

of A'' pairs of deviations x and y from mean values is 
equal to 

X ' 

where y^ is the mean of the x-array of y's, and iV, is the 
number in the array. Then 

ruj,(Ty = mean value of xy^ = ^ ^^ xyj^,. . 

By attaching this meaning to the correlation coefficient 
^itiv ^^U and/r and using (7) f(^r the mean of the array, 
we have 

fSiSf^Si^ff = mean value of — h]t - -^'^^ 


= T-^ (i^ean value of dft^) = — ^ aft^ 
l — pt i — pt 

(8) =-sptPt'iTom(l) 

(9) =-M' 

as a first approximation. 

A systematic error is involved in replacing sptpt^ by 
ftft'/$ on account of the correlation between // and ff. 


To deal with the effect of this correlation, we may first 
write (3), page 83, in the form 


r = l 

If we are dealing with a population or theoretical dis- 
tribution rather than with a sample, this formula gives 
us the proposition that the expected value of the product, 
Xiyi, of pairs of variables is equal to the product, xy, of 
their expected values increased by the product, ra^cTy, of 
the correlation coefficient and the two standard devia- 

To apply this proposition when X/=// and 3^,=/^, we 
note from (8) that, for the population, r<Jx(Ty= —sptpf, 
and recall that E(ft)=spt and E{ft')=spt>. Then the 
proposition stated above gives us 

E{ft]t') = s''pipi'-sptpt', 

(10) E{JJc/s) = \E{fJ,) = {s- \)p,pt' . 

To obtain the right member of (8) as accurately as 
possible in terms of the observed // and // , we multiply 
both members of (10) by s/{s — l) and then note that 
fift'/{s — l) has the expected value sptpf In the right 
member of (9), the value ftfc/s used as an approxima- 
tion to sptpt' thus contains a certain systematic error. To 
eliminate the systematic error from (9), we write 

(11) ry,r,af,a,,= -{^^ 

in place of (9) as a second approximation to (8). 


42. Remarks on the assumptions involved in the deri- 
vation of standard errors. The three outstanding assump- 
tions that should probably be emphasized in considering 
the validity and the limitations of the results (2) and (9) 
are (a) that the probability that a variate taken at 
random will fall into any assigned class remains constant, 
(b) that the number 5 is so large that we obtain certain 
valuable approximations by using the relative frequency 
ft/s in place of the probability pt that a variate taken at 
random will fall into the class /, and (c) that any sampling 
deviation dft from the expected value of a class frequency 
is accompanied by an apportionment of —dft to other 
class frequencies in amounts proportional to the expected 
values of such other class frequencies. Our use of assump- 
tion (b) involves more than is apparent on the surface, 
because in its use we not only replace a single isolated 
probability pthy a, corresponding relative frequency /</5, 
but we further assume the liberty of using certain func- 
tions of the relative frequencies in place of these functions 
of the corresponding probabiHties or expected values. 
This procedure may lead to certain systematic errors in 
addition to the sampling errors. For example, we have, 
in obtaining (2), used the function //(I— //A) oi ft/s in 
place of the same function spt(l—pt) of the expected 
value pt, and have by this procedure tended to under- 
estimate the expected value when 5 is finite. That is, 
sft{l—fi/s)/{s—l) and not ft{^—ft/s) is our best esti- 
mate of the expected value. However, when s becomes 
large, ft{\ —ft/s) is a valuable first approximation to the 
expected value. 

The rule that the expected value of a function may be 
taken as approximately equal to the function of the ex- 


pected value has been much used by statisticians in a 
rather loose and uncritical manner. A critical study of the 
application and limitations of this rule was published by 
Bohlmann^^ in 1913. While it is beyond the scope of this 
monograph to enter upon a general discussion of Bohl- 
mann's conclusions, it is of special interest for our purpose 
that the application of the rule leads at least to first ap- 
proximations when the functions in question are algebraic 
functions. Although it may seem that we have in the 
derivation of (2) and (9) taken the liberty to substitute 
relative frequencies rather freely in place of the proba- 
bilities required in an exact theory, this procedure may be 
extended to any algebraic functions when the number s 
is very large, with the expectation of obtaining useful 
approximations. Since certain derivations which follow 
make use of (2) and (9), the resulting formulas involve the 
weaknesses and limitations of the above assumptions. 

43. Standard error in the arithmetic mean and in a 
qih moment coefficient about a fixed point. For the arith- 
metic mean of s observed values of a variable x we write 


where /i is the class frequency of Xt. 

Suppose the s values constitute a random sample of 
observations on the variable x. Suppose further that we 
continue taking observations on x until we have a very 
large number of random samples each consisting of s ob- 
served values. Then assume that there exists an expected 
value of each ft about which the observed /t's exhibit 
dispersion, and that corresponding to these expected val- 


ues there exists a theoretical mean value x oi x about 
which the ^'s calculated from samples of 5 exhibit dis- 
persion. Using 5/ and dx to denote deviations in any sam- 
ple from the expected values of /and x, respectively, we 

s8x = ^Xt8ft f 

sK8xy = T.i^Md + 2j^'{xat'8ft8ft') , 

where the sum J^ extends from t = ltot = n, and 2' is the 
sum for all values of t and /' for which t4=t'. 

Next, sum both members of this equality for all sam- 
ples and divide by the number of samples. This gives in 
the notation for standard deviations (p. 1 19) and for the 
correlation coefficient (p. 123), 

s^4'=^Y^{xyf^)-\-2J^\xtXt'af^af^,rf^f^,) , 
By using (1) and (8), we have 

s4 = T.ix]pt)-J^(x]p^t) - 2Yl'{xtXt'ptPt) 
= ni-{7:xtPty==tii-x' = a\ 

where a is the standard deviation of the theoretical dis- 
tribution. Then 

(12) '-.=^2 • 

Instead of the a of the theoretical distribution, we 
ordinarily use the a obtained from a sample. To introduce 
the expected value of a^ from the sample, we may, for a 
first approximation, use (2) and (9) in place of (1) and 
(8) above, and obtain very simply a form identical with 


As a second approximation, we may use (3) and (11) 
in place of (1) and (8) above, and obtain very simply 

(13) 4 = -^ and 

1 ^ (5-1)^/2 

where a is to be obtained from the sample. 

The distinction between the expected value of g^ from 
the population and from the sample involves a rather 
delicate point, but one that has been long recognized in 
the literature of error theory. The distinction has been 
rather generally ignored in books on statistics. In nu- 
merical problems, the differences in the results of formulas 
(12) and (13) are negligible when s is large. 

The standard deviation (standard error) may well 
serve as a measure of sampling fluctuations. But custom 
has not established the direct use of the standard error 
to any considerable extent. The so-called probable erroi 
has come into much more common use than the standard 
error. The probable error E is sometimes defined very sim- 
ply as .6745 times the standard error without regard to the 
nature of the distribution. This definition of the probable 
error does not impose the condition that the distribution 
of results obtained on repetition shall necessaiily be a 
normal distribution. But with such a definition of prob- 
able error, the real difficulty is not overcome, but merely 
shifted to the point where we attempt an interpretation 
of the probable error in terms of the odds in favor of or 
against an observed result obtained from a sample falling 
within an assigned deviation of the true value. 

Thus, in the derivation of (12) we have obtained, sub- 
ject to certain important limitations, the standard devia- 


tion of means x obtained from samples of 5 about a theo- 
retical mean value x which may ordinarily be regarded 
as a sort of a true value of the mean. If the distribution 
of x's obtained from samples about such a true value is 
assumed to be a normal distribution, we may by the use 
of the table of the probability integral state at once that 
the odds are even that an x obtained from a sample will 
differ numerically from the true value by not more than 

£=.6745 (standard error) . 

It is the assumption of a normal distribution of the means 
from samples combined with the specification of an even 
wager that brings the multiplier .6745 into the problem. 

We may further expedite the treatment of sampling 
errors by finding the odds in favor of or against an ob- 
served deviation from the true value not exceeding nu- 
merically a certain multiple of E, say tE. As t increases 
to 5, 6, or more, the odds in favor of obtaining a deviation 
smaller than tE are so large as to make it practically 
certain that we will obtain such a smaller deviation. 

We have discussed briefly the meaning and limitations 
of probable errors. The most outstanding limitation on 
the interpretation of probable errors is the requirement 
of a normal distribution of the statistical constant under 
consideration. We have to a considerable extent used the 
arithmetic mean as an illustration, but the same general 
requirements about the normality of the distribution 
would clearly apply, whatever the statistical constant. 

We shall consider next the standard error in a qth. 
moment coefficient Hq about a fixed point. By definition, 


For the relation between deviations from theoretical val- 
ues we have 


Sum both members of this equality for a large number 
of samples N and divide by N. This gives in the notation 
for standard deviations (p. 119) and for the correlation 
coefficient (p. 123) 

Using (1) and (8), we have 

S(Tk = 12ixVpd - Tix-^m - 2Z'WxVptPt) 



(14) ---p-^^T . 

where the moments in the right-hand member relate to 
the theoretical distribution. By methods analogous to 
those used in the case of the arithmetic mean (pp. 127- 
28), we may pass to moments which relate to the sample. 
The probable error of ju^ is then £ = .6745<7^^, and 
the usual interpretation of such a probable error by means 


of odds in favor of or against deviations less than a multi- 
ple of E is again dependent on the assumption that the 
^th moments /x? fpund from repeated trials form a 
normal distribution. 

44. Standard error of the ^th moment ixq about a 
mean. In considering the problem of the standard error 
of a moment about a mean, it is important to recognize 
the difference between the mean of the population and a 
mean obtained from a sample. 

For simplicity, we shall consider the problem of the 
standard error in a gth moment about the mean of the 
population when we take samples of ^ variates as in § 43. 
The mean of the population is a fixed point about which 
we take the qih. moment of each sample of s variates. 
Then if we follow the usual plan of dropping the primes 
from the ju's to denote moments about a mean, we write 
from (14) 

for the square of the standard error of /x, in terms of 
moments of the theoretical distribution. 

In particular, we have for the standard error of the 
second moment 

o-J2=(m4— M2^)A . 

When the distribution is normal, 

M4 = 3^2^ , and a\^ — 2fjL2^/s • 
Since o- = (jU2)^S we have 

«<. = : ^"^ 



Square each member, sum for all samples, and divide 
by the number of samples. This gives 

4(7^ 4sa^ 2s 

7^ = S = ^ or .. = ./(2.)V2 

Hence, the probable error in approximating to the 
standard deviation a of the population by the standard 
deviation from a sample of s variates is given approxi- 
mately by 

.6745(7. = .6745(7/(25)1/2. 

To avoid misunderstanding, it should perhaps be em- 
phasized that we have throughout this section restricted 
our discussion to the gth moment about the mean of the 
population. The problem of dealing with the standard 
error in the qih moment about the mean of a sample 
offers additional difficulties because such a mean varies 
from sample to sample. A problem arises from the cor- 
relation of errors in the means and in the corresponding 
moments. Further problems arise in considering the close- 
ness of certain approximations, especially when the mo- 
ments are of fairly high order, that is, when q is large. 
We shall simply state without demonstration that the 
square of the standard error in the ^th moment about 
the mean of a sample is given by 


as a first approximation. For ^=2, this expression be- 
comes (iJL4 — fx->^)/s. For ^ = 4, it becomes (/xg — Ai4^)A '^^ the 
case of a normal distribution. These expressions for the 
special cases q = 2 and g = 4 are the same as for the mo- 
ments about a fixed point. 


45. Remarks on the standard errors of various statis- 
tical constants. We have shown a method of derivation 
of the standard errors in certain statistical constants (the 
mean, the ^th moment about a fixed point), and in partic- 
ular the derivation of probable error of the mean. Our 
main purpose has been to indicate briefly the nature of 
the assumptions involved in the derivation of the most 
common probable-error formulas. The next step would 
very naturally consist in finding the correlations of errors 
in two moments. Following this, we could deal with the 
general problem of standard errors in parameters of fre- 
quency functions of one variable on the assumption that 
the parameters may be expressed in terms of moment 
coefiicients. Thus, let 

y=Ax, ci, C2, ) 

be any frequency curve, where any parameter 

Ci = <i>{x, i[i2, M8, .... ,M«, • ... ) 

is a function of the mean and of moments about the mean. 

Suppose that this relation is such that we may express 
bCi in terms of bx, 5/x2, 5/>i3, . . . . , at least approximately 
by differentiation of the function </>. If we then square 
6q, sum, and divide by the number of samples, we obtain 
an approximation to the square of the standard error in q. 

While, in a general way, this method may be described 
as a straightforward procedure, the derivation of useful 
formulas is likely to involve rather laborious algebraic de- 
tails. Moreover, considerable difficulty may arise in esti- 
mating the errors involved in the approximate results. 

The difficulties of estimating the magnitude of the 


errors involved are likely to be much increased when the 
statistical constant, for example, a correlation coefficient, 
is a function not merely of the moments of the separate 
variables, but also of the product moments of two vari- 

In concluding these remarks on standard errors of sta- 
tistical parameters obtained from moments of observa- 
tions, it may be of interest to' point out that the character- 
ization of the sampling fluctuations in such parameters 
may be extended and refined by the use of higher-order 
moments of the errors in the parameters. B. H. Camp has 
shown that the use of moments of order higher than two 
may very naturally be accompanied by the use of a cer- 
tain number of terms of Gram-Charlier series as a dis- 
tribution function.^® 

46. Standard error of the median. Thus far in our 
discussion of standard errors and probable errors, we have 
assumed that the statistical constants or characteristics 
of the frequency function are given as functions of the 
moments. There are, however, useful characteristics such 
as a median, a quartile, a decile, and a percentile of a 
distribution which were not ordinarily given as fimctions 
of moments. Such a characteristic number used in the 
description of a distribution is ordinarily calculated from 
its definition, which specifies that its value is such that a 
certain fractional part of the total frequency is on either 
side of the value in question. For example, a median m 
of a given distribution is ordinarily calculated from the 
definition that variates above and below m are to be 
equally frequent. Similarly, a fourth decile D4 is calcu- 
lated from the definition that four-tenths of the frequency 
is to be below Da. We are thus concerned with the sam- 


pling fluctuations of the bounds of the interval which in- 
cludes an assigned proportion of the frequency. 

To illustrate further, let us consider the standard error 
in the median m of samples of iV of a variable x distributed 
in accord with a continuous law of frequency given by 
y—f{x). We assume that there exists a certain ideal medi- 
an value M of the population of which we have a sample 
of N and that by definition of the median 1/2 is then the 
probability that a variate taken at random falls above (or 
below) M. We may then write that in any sample of 
N variates taken at random from the indefinitely large 
set, the number above M is N/2-\-d. That is, the median 
m of the sample is at a distance 8x = dm from M. When 
y has a value corresponding to a value of ai; in the inter- 
val 8m, we may write 

ydm = d 

to within infinitesimals of higher order. 

Such an equation connects the change 8m in the me- 
dian of the sample from the theoretical M with the sam- 
pling deviation d of the frequency above M. Then 

6m == - and <^m = -oOd • 
But, from (1), page 119, 

ffd = Npq = — . Hence <t« = -^ — . 
If we have a normal distribution 


the value of y at the median is given by 

-TT^ =. 39894- 

and the standard error in the median found from ranks is 



<r« = 

« ^t/2 

Although the theoretical values of the median and of 
the arithmetic mean are equal in a normal distribution, 
the median found from a sample by ranking has a sam- 
pling error 1.2533 times as large as the arithmetic mean 
obtained as a first moment from the same sample. 

47. Standard deviation of the sum of independent 
variables. In sampling problems, it is often found useful 
to know the expected value of the square of the standard 
deviation of the sum F = Xi-[-X2+ • • • • -\-Xi of s mu- 
tually independent variables when we have given the 
standard deviations <ri, 0-2, . . . . , o", of each variable in 
the population to which it belongs. 

Assuming that the given deviations are measured from 
the theoretical or expected values for the populations, we 
consider deviations x, = X/ — £(X,), and write the devia- 
tion of the sum 

y = a:i-h^2+ • • • • -\-Xs . 

Square both sides, sum for the number of samples 
iV^, and divide by N . Then we have 


If we pass to expected values, and let o-J, (ri, . . . . , (Ti 
denote the squares of standard deviations of the several 
variables and <rj that of their sum in the populations, 
we have 

(16) al^ai+ol+ ""+oi, 

the product terms vanishing by V, page 117. 

It is a matter of some interest to note how the ex- 
pected value just found differs from the expected value 
of the sum of squares of the s deviations of Xi, X2, . . , , x, 
from their mean 


obtained from a sample. If we let 

1 * 
(17) x'i=Xi--Y,Xi, 

we are to find E{xi^-{-X2^-\- • • • • +x'/), in terms of 
E(x]) = <r^i (f = 1,2, .... , s). From (17) we may write 

,_5-l Xi X, 

X\ = X\ .... 

s s s ' 

f 5—1 Xi X, 

s s s 

Then for i^j we have 

42^ .... +0:? = ^ {x,'+ . . • . +x,')-^Y.^iXi. 
5 5 


Hence, passing to expected values, using V, page 117, 

(18) EW+ .... +^i^)=^ (^!+ • • • • +oX) . 

48. Remarks on recent progress with sampling errors 
of certain averages obtained from small samples. In the 
development of the theory of sampling, the assumption 
has usually been made that the sample contains a large 
number of individuals, thus leading to the expectation 
that the replacement of probabilities by corresponding 
relative frequencies will give a valuable approximation. 
But the lower bound of large numbers has remained poor- 
ly defined in this connection. For example, certain prob- 
able-error formulas have been applied to as few as ten 

Beginning with a paper by Student^' in 1908 there 
have been important experimental and theoretical results 
obtained on the distribution of arithmetic means, stand- 
ard deviations, and correlation coefiBicients obtained from 
small samples. 

In 1915, Karl Pearson^^ took an important step in ad- 
vance by obtaining the curve 

(19) y^y^x'^^e"^' 

for the distribution of the standard deviations of samples 
of n variates from an infinite population distributed in 
accord with the normal curve. 

By finding the moments /X2, M3, and jLt4 of this theoreti- 
cal distribution, and then tabulating the corresponding 


and the skewness of the curve (19) for integral values of 
n from 4 to 100, and making use of the fact that i3i = 0, 
ft = 3, and sk (skewness) =0 are necessary conditions for 
a normal distribution, Pearson shows experimentally that 
the distribution of standard deviations given by (19) ap- 
proaches practically a normal distribution as n increases. 
In this experiment, the necessary conditions A = 0, ft = 3, 
and sk — are assumed to be sufficient for practical ap- 
proach to a normal distribution. 

From this table of values, Pearson concludes that for 
samples of 50 the usual theory of probable error of the 
standard deviation holds satisfactorily, and that to apply 
it to samples of 25 would not lead to any error of impor- 
tance in the majority of statistical problems. On the other 
hand, if a small sample, n<20 say, of a population be 
taken, the value of the standard deviation found from 
the sample tends to be less than the standard deviation 
of the population. 

In a paper published in 1915, R. A. Fisher^^ dealt with 
the frequency distribution of the correlation coefficient r 
derived from samples of n pairs each taken at random 
from an infinite population distributed in accord with the 
normal correlation surface (p. 104), where p is the cor- 
relation coefficient. The frequency function %=/„(r) 
given by Fisher for the distribution of r was such that 
the investigation of its approach to a normal curve as n 
increases seemed to require special methods for comput- 
ing the ordinates and moments. Such special methods 
were given in a joint memoir^° by H. E. Soper, A. W. 
Young, B. M. Cave, A. Lee, and Karl Pearson. The val- 
ues of jSi and ft were computed for these distributions to 
study the approach to the normal curve. 


With respect to the approach of these distributions 
to the normal form with increasing values of w, it is found 
that the necessary conditions /3i = 0, p2 = S for a normal 
distribution are not well fulfilled for samples of 25 or 
even 50, whatever the value of p. For samples of 100, the 
approach to the conditions iSi = 0, ft = 3 is fair for low 
values of p, but for large values of p, say p>.5, there is 
considerable deviation of /3i from 0, and of ft from 3. For 
samples of 400, on the whole, the approach to the neces- 
sary conditions /3i = 0, ft = 3 is close, but there is quite a 
sensible deviation from normality when p^ .8. These re- 
sults give us a striking warning of the dangers in inter- 
preting the ordinary formula for the probable error of r 
when we have small samples. 

As to the limitations on the generality of these results, 
it should be remembered that the assumption is made, in 
this theory of the distribution of r from small samples, 
that we have drawn samples fi-om an infinite population 
well described by a normal correlation surface, so that the 
conclusions are not in the strictest sense applicable to 
distributions not normally distributed. While the results 
just now described have thrown much light on the dis- 
tributions of statistical constants calculated from small 
samples, it is fairly obvious that much remains to be done 
on this important problem. 

49. The recent generalizations of the Bienayme- 
Tchebycheff criterion. Although the use of probable errors 
for judging of the general order of magnitude of the nu- 
merical values of sampling deviations is a great aid to com- 
mon-sense judgment; it must surely be granted that we 
are much hampered in drawing certain inferences depend- 
ing on probable errors because of the hmitation that the 
interpretation of the probable error of a statistical con- 


stant is to some extent dependent in any particular case 
on the normality of the distribution of such constants 
obtained from samples, and because of the lack of knowl- 
edge as to the nature of the distribution. 

Any theory that would deal effectively with the prob- 
lem of finding a criterion for judging of the magnitude of 
sampling errors with little or no limitation on the nature 
of the distribution would be a most welcome contribution, 
especially if the theory could be made of value in dealing 
with actual statistical data. The Biena3ane-Tchebycheff 
criterion (p. 29) may be regarded as an important step 
in the direction of developing such a theory. We have in 
the Tchebycheff inequality a theorem specifying an upper 
bound 1/X^ for the probability that a datum taken at 
random will be equal to or greater than X times the stand- 
ard deviation without limitation on the nature of the 
distribution. That is, if P(X(7) is the probability that a 
datum drawn at random from the entire distribution will 
differ in absolute value from the mean of all values as 
much as X<7, then 

(20) PM^^,. 

To establish a first generalization of this inequality 
(cf. p. 29), let us consider a variable x which takes mutual- 
ly exclusive values Xi^ x^, . . . . , Xn with corresponding 
probabiUties, pi, />2, . . , pny where pi-\-p2+ . . +/>»=!. 

Let a be any number from which we wish to measure 
deviations. For the expected values of the moment of 
order 2s about a, we may write 

M2. = />l^^ + M^+ +pn(Pn' , 

where di = Xi — a. 


"Letd^d^'j . . . . , be those deviations it:, — fl which are 
numerically as large as an assigned multiple \a (X>1) of 
the root-mean-square deviation, and let p\ p'\ . . . . , be 
the corresponding probabilities. Then we have 

ti2s^p'd'^'-^p"d"^'+ . 

Since d\ d" , . . . . , are each numerically as large as 
Xo", we have 

mJ-^X'V^(/''+P"+ • • • • ) . 

If we let P(\(t) be the probability that a value of x 
taken at random will differ from a numerically by as much 
as X(7, then'P(Xcr) =/?'+/?''+ . . . . , and 




and the probability of obtaining a deviation numerically 
less than Xcr is greater than 

J M2S 

X2V2* • 

This generalization of the Tchebycheff inequality is 
due to Karl Pearson^* except that he assumed a distribu 
tion given by a continuous function with a as the mean 
x-coordinate of the centroid of frequency area. For this 
case, we should merely drop the prime from /x^, and write 

(21) P(X(t)^ ^^ 



With 5= 1, we obviously have the Tchebycheff inequality 
as a special case. 

It is Pearson's view that, although his inequality is in 
most cases a closer inequality than that of Tchebycheff, 
it is usually not close enough to an equality to be of 
practical assistance in drawing conclusions from statisti- 
cal data. On the whole, Pearson expresses not only dis- 
appointment at the results of the Tchebycheff inequality, 
but holds that his own generalization still lacks, in gen- 
eral, the degree of approximation which would make the 
result of real value in important statistical applications. 
Hence, it is an important problem to obtain closer in- 
equalities. The problem of closer inequalities has been 
dealt with in recent papers by several mathematicians.®^ 
Camp, Guldberg, Meidel, and Narumi have succeeded 
particularly well by placing certain mild restrictions on 
the nature of the distribution function F{x). The restric- 
tions are of such a nature as to leave the distribution func- 
tion sufficiently general to be useful in the actual prob- 
lems of statistics. The main restriction placed on F{x) 
by Camp is that it is to be a monotonic decreasing func- 
tion of I re I when |:r|^C(7, c^O. The general ejffect of this 
restriction is to exclude distributions which are not rep- 
resented by decreasing functions of | :r | at points more 
than a certain assigned distance from the origin. We shall 
now present the main results of Camp without proof. 

With the origin so chosen that zero is at the mean, he 
reaches the generalized inequality 



m fMS^'^TTJ^+TT-/" 



/32«-2=-^ and 

(c_ 2s Y 
(2.+ l)g-l) 

When c = 0, the formula (22) is Pearson^s formula (21) 
divided by (1 + 1/25)2*. 

The general effect of the work of Camp and Meidel 
has been to decrease the larger number of the Pearson 
inequality (21) by roughly 50 per cent. These generaliza- 
tions seem to have both theoretical and practical value 
when we have regard for the fact that the results apply 
to almost any type of distribution that occurs in practical 
applications. Indeed, it is so satisfying to have only very 
mild restrictions on the nature of the distribution in judg- 
ing sampling errors that further progress in extending the 
cautious limits of sampling fluctuations given by the 
generalizations of the Tchebycheff inequality would be 
of fundamental value. 

50. Remarks on the sampling fluctuations of an ob- 
served frequency distribution from the underljdng theo- 
retical distribution. If we have fitted a theoretical fre- 
quency curve to an observed distribution, or if we know 
the theoretical frequencies from a priori considerations, 
the question often arises as to the closeness of fit of theory 
and observation. In considering this question, a criterion 
is needed to assist common-sense judgment in testing 
whether the theoretical curve or distribution fits the ob- 
served distribution well or not. It is beyond the scope of 
the present monograph to deal with the theory underly- 
ing such a criterion, but it seems desirable to remark that 
the fundamental paper on this important problem of 


random sampling was contributed by Karl Pearson under 
the title, "On the criterion that a given system of devia- 
tions from the probable in the case of a correlated system 
of variables is such that it can be reasonably supposed 
to have arisen from random sampling," Philosophical 
Magazine, Volume 50, Series 5 (1900), pages 157-75. 

Closely related to the problem of the closeness of fit 
of theory and observation is the fundamental problem of 
establishing a criterion for measuring the probability that 
two independent distributions of frequency are really 
random samples of the same population. Pearson pub- 
lished one solution of this problem in Biometrika, 
Volume 8 (1911-12), pages 250-54. The resulting crite- 
rion represents an important achievement of mathemati- 
cal statistics as an aid to common-sense judgment in con- 
sidering the circumstances surrounding the origin of a 
random sample of data. 


51. Introduction. We have throughout Chapter II as- 
sumed a constant probability underlying the frequency 
ratios obtained from observation. It is fairly obvious that 
frequency ratios are often found from material in which 
the underlying probability is not constant. Then the sta- 
tistician should make use of all available knowledge of the 
material for appropriate classification into subsets for 
analysis and comparison. It thus becomes important to 
consider a set of observations which may be broken into 
subsets for examination and comparison as to whether 
the underlying probability seems to be constant from sub- 
set to subset. In the separation of a large number of rela- 
tive frequencies into n subsets according to some appro- 
priate principle of classification, it is useful to make the 
classification so that the theory of Lexis may be applied. 
In the theory of Lexis we consider three types of series 
or distributions characterized by the following properties: 

1. The underlying probability p may remain a con- 
stant throughout the whole field of observation. Such a 
series is called a Bernoulli series, and has been considered 
in Chapter 11. 

2. Suppose next that the probability of an event 
varies from trial to trial within a set of s trials, but that 
the several probabilities for one set of s trials are identical 
to those of every other of n sets of s trials. Then the 
series is called a Poisson series. 

3. When the probability of an event is constant from 



trial to trial within a set but varies from set to set, the 
series is called a Lexis series. 

The theory of Lexis^^ uses these three types as norms 
for comparison of the dispersions of series which arise in 
practical problems of statistics. An estimate of the im- 
portance of this theory may probably be formed from the 
facts that Charlier^ states in his Vorlesungen iiber mathe- 
matischen Statistik (1920) that it is the first essential step 
forward in mathematical statistics since the days of La- 
place, and that J. M. Keynes®^ expressed a somewhat 
similar opinion in his Treatise on Probability (1920). 
These may be somewhat extreme views when we recall 
the contributions of Poisson, Gauss, Bravais, and Tcheby- 
cheff but they at least throw light on the outstanding 
character of the contribution of Lexis to the theory of 
dispersion. The characteristic feature of the method of 
Lexis is that it encourages the analysis of the material 
by breaking up the whole series into a set of sub-series for 
examination of the fluctuation of the frequency among 
the various sub-series. Such a plan of analysis surely has 
the sanction of common-sense judgment. 

In drawing s balls one at a time with replacements 
from an urn of such constitution that p is the constant 
probability that a ball to be drawn will be white, we have 
already established the following results for Bernoulli 

L The mathematical expectation of the number of 
white balls is sp (p. 26). 

2. The standard deviation of the theoretical distribu- 
tion of frequencies is {spqY'^ (p. 27). 

3. The standard deviation of the corresponding dis- 
tribution of relative frequencies is (pq/sY^^ (p. 27). 


52. Poisson series. To develop the theory of the Pois 
son series let 5 urns, 

Uu C/2, U., 

contain white and black balls in such relative numbers 

Ph /'2, . . . . , />• 

are the probabilities corresponding to the respective urns 
that a ball to be drawn will be white. Let 

(1) ^^h+P.+ ■■■■+P. 

From (1) it follows that the mathematical expectation sp 
of white balls in a set of s obtained one from each urn 
is exactly equal to the mathematical expectation of white 
balls in drawing s balls with a constant probability p of 
success. The standard deviation dp of the theoretical dis- 
tribution of the number of white balls per set of 5 is re- 
lated to the standard deviation (TB = {spqy^^ of a hypo- 
thetical Bernoulli distribution with a constant probability 
p of success, by the equation 

X=S J=5 

(2) <T'p = spq-^{p,-py = cr%-'y^(p.-p)\ 

where p is equal to the mean value of pi, pzy . . . . , ps. 
To prove this we start with (1) and recall that sp is the 
arithmetic mean of the number of white balls in any set 
of s under the theoretical distribution. 


Let us consider next the standard deviation <t of white 
balls in the theoretical series of s balls. The square of 
the standard deviation of the frequency of white balls in 
drawing a single ball with the chance pt that it will be 
white is given by (T\ = ptqti that is, by making ^=1 in 

When the probabilities pi, p2, . . . . y ps are inde- 
pendent of one another, it follows from (16), page 137, 

where <ti, (T2, . . . . , (t, are the standard deviations of 
white balls in drawing one ball from each urn correspond- 
ing to probabilities pi, p2, . . . . , ps, respectively, and a 
is the standard deviation of white balls among the s balls 
together drawn one from each urn. 
Hence, we have 

(3) a^ = piqi+p2q2+ ' ' ' +M«= ^/'t?* • 





Pt = p+(pt-p) , qt = q-(pt-p) 
Ptqt = pq- {pt -P){p-q)-ipt- pY 

(4) ^Piqt = spq-^{Pt-py y since ^{pt-p)='0 . 

t=\ i=i 

Hence, we have established (2), from which it follows at 
once that the standard deviation of a Poisson series is less 


than that of the corresponding Bernoulli series with con- 
stant, probability of success equal to the arithmetic mean 
of the variable probabilities of success. 

To give an illustration of a Poisson series, conceive of 
n populated districts. Each district is to consist of s sub- 
divisions for which the probability of death at a given age 
varies from one subdivision to another, but in which the 
series of s probabilities are identical from district to dis- 
trict. To illustrate further this type of distribution, con- 
struct an urn schema consisting of 10 urns each of which 
contains 15 balls, and in which the number of white balls 
in the respective urns is 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. The 
arithmetic mean of the probabilities of drawing a white 
ball is 1/2. A set of 10 is obtained by drawing one ball 
from each urn. Then each ball is returned to the urn 
from which it was drawn, and a second set of 10 is drawn. 
This process is continued until we have 1,000 sets of 10. 
The resulting frequency distribution of the number of 
white balls is a Poison distribution. 

53. Lexis series. To give a statistical illustration of a 
Lexis series, conceive of n populated districts in each of 
which the probability of death is constant for men of 
given age, but is variable from district to district. 

To develop the theory of the Lexis distribution we 
draw s balls one at a time from an urn U\ with a constant 
probability pi of getting a white ball, from U2 with a con- 
stant probability /?2, . . . . , from £/„ with a constant 
probability />„. 

The mathematical expectation of white balls in thus 
drawing ns balls is spi-\-sp2-\- • • • • +spn = nsp, where 
p^(\/n)(pi+p2-{- • • • • +pn) is the arithmetic mean of 
the probabilities pi, p2y . . . . , />„. 


Since nsp is the mathematical expectation of white 
balls in samples of ns balls, the mathematical expectation 
in samples of 5 balls one at a time from a random urn is 
sp. This value sp is identical to the mathematical ex- 
pectation of white balls in samples of 5 balls of a Bernoulli 
series with a constant probability p. 

Since p t is the probability that a ball to be drawn from 
urn Ut will be white, the expected value of the square of 
the standard deviation of the number of white balls in 
samples of s drawn from Ut is sptQi- In other words, 
sptqt is the mean square of the deviations of white balls 
from spt in samples of 5 drawn from Ut. If the deviations 
were measured from sp instead of spty it follows from the 
theorem (p. 21) for changing the origin or axis of second 
moments that the mean square of the deviations would be 

(5) sptqt+{spt-spy . 

Suppose this mean value of the squares of deviations 
were obtained from N samples of 5 each. Then 

(6) Nsptqt^Ns\Pt-py 

would be the expected value of the sum of squares of the 
deviations from sp in the A^ samples of s drawn from Ut. 
By adding together the expression (6) for / = !, 2, 
. . . . , w, we have 

(7) f^s^p,q,^Ns'^{Pt-py 

for the expected value of the sum of squares of the devia- 
tions from sp for the n urns. In obtaining (7), we have 

t=l (=1 



drawn in all Nn sets of s balls of which N sets are from 
each um. 

The mean-square deviation from sp of the number of 
white balls in samples of s thus taken from the n urns 
Uiy U2, . . . . y Unis then obtained by dividing (7) by the 
number of sets Nn, This gives 

From (4) above 


^Ptqt = npq-^{pt-py , 
t=i t=i 

and hence 

It should be observed from (8) that the standard devia- 
tion of a Lexis distribution is greater than that of a 
Bernoulli distribution based on a constant probability p 
which is equal to the mean value of the given probabili- 
ties pi, />2,. . • • J Pn- 

54. The Lexis ratio. Let c' be the standard deviation 
of a series of relative frequencies obtained by experiment 
from statistical data. On the hypothesis of a Bernoulli 
distribution the theoretical value of the standard devia- 
tion is (TB = (pq/sy^^ where p is the probability of success 
in any single trial. The ratio 

Mj — / — 



is called the Lexis ratio, where a = S(t' and o-^ = s<tb. When 
L — \,^ the series of relative frequencies is said to have 
normal dispersion. When L< 1, the series is said to have 
subnormal dispersion. When L>\, the series is said to 
have supernormal dispersion. Illustrative applications 
of the Lexis ratio to statistical data are readily available.®^ 
From the nature of the Lexis theory it is fairly obvi- 
ous, as implied in the introduction to this chapter, that 




Deaths per 1,000 

California .... 










Minnesota . ... 


North Carolina 






Arithmetic mean 



the application of the theory to particular statistical data 
involves breaking up the aggregate into a number of sub- 
sets according to some appropriate scheme of classifica- 
tion which would ordinarily depend on much knowledge 
of the material which is the subject of the investigation. 
Then we are concerned not only with a frequency ratio 
for the entire aggregate, but also with the stability of 
frequency ratios among the subsets. The dispersion of 
frequency ratios is calculated and compared with the ex- 
pected value in the case of a Bernoulli distribution. 

As an example, let us consider the dispersion of death- 
rates of white infants under one year of age in registration 


states^ of the United States in which the number of 
births per year of white children is between 33,000 and 
67,000 (see Table I). This restriction is placed on the 
selection of states so that the number of instances per set 
has only a moderate amount of variabihty. 

In most of the practical problems of statistics the 
exact values of the underlying probabilities are unknown 
and the best substitutes available are the approximate 
values of the probabilities given by available relative fre- 
quencies. Substituting these frequency ratios as approxi- 
mations for p and g, we find the Bernoulli standard 
deviation from the formula cFB^ipq/sY^'^. We then com- 
pare (t'b with the standard deviation obtained directly 
from the data. The simple arithmetic mean of the death- 
rates is 65.7 per 1,000, and their standard deviation (with- 
out weighting) is 5.21 per 1,000. If these infantile death- 
rates constituted a Bernoulli distribution with a number 
of instances equal to the average number of births, 55,257 
in each case, w^e should have 

, _ (pq\ 1/2 _ [- (.0657)(.9343) 1 1/2 

^^"Vt; ~l 55:257 J 

= .00105 per person, = 1.05 per 1,000 . 
Hence, the Lexis ratio is 


Hence the dispersion is supernormal, and we have 

trong support for the inference that there is a significant 

variation in infant mortality from one of these states to 


another. The full interpretation of this fact would re- 
quire much knowledge of the sources of the material. 

A reasonable plan for the determination of the maxi- 
mum district over which the infantile death-rates are 
essentially constant seems to involve breaking the aggre- 
gate of instances into subsets in a variety of ways and 
then testing results as above. Some measure of doubt will 
remain, but this procedure encourages the kind of analysis 
that gives strong support to induction. 



55. Introduction. In § 56 we shall attempt to show (cf. 
p. 65) that a certain line of development^^ of the binomial 
distribution suggests the use of the Gram-Charlier Type 
A series as a natural extension of the De Moivre-Laplace 
approximation and the Type B series as a natural exten- 
sion of the Poisson exponential approximation considered 
in Chapter II* Then in §§ 57-58 we shall develop meth- 
ods for the determination of the parameters in terms of 
moments of the observed frequency distribution, thus 
deriving certain results stated without proof in § 19 
and §21. 

56. On a development of Type A and Type B from the 
law of repeated trials. As in the De Moivre-Laplace the- 
ory, we consider the probability that in a sample of 5 indi- 
viduals, taken at random from an unlimited supply, r 
individuals will have a certain attribute. That is, the 
probability we wish to represent is given by 

and we shall use a function of the form 
(1) Bo(x) = ~^re{w)e''-^dw, 



for interpolation between the values 5(r), where i^=—l 


(2) e(w) = ipe-^+qy = ^B{r)^ . 

In the terminology of Laplace, d(w) is the generating func- 
tion of the sequence B{r). 

We shall first show that Bo(x)=B{m) when ic is a 
positive integer m. To prove this, substitute d(w) from 
(2) in (i) and integrate. This gives 

sin (r—x)ir 

f=0 ^ ^ 

{s — x)v 

When x==w is a positive integer, each term but one of 
the right member vanishes and this one has the value 
B {m) . Accordingly, 5o(w) = B {m) . 

Thus formula (1) gives exactly the terms of the expan- 
sion of {p-\-q)* for positive integral values x=m. It may 
be considered an interpolation formula for values of x 
between the integral values. 

We shall be interested in two developments of this 
interpolation formula. The first is based on the develop- 
ment of log 0{w) in powers of w, and the second on the 
development in powers of p. The resulting types of de- 
velopment are known as the Type A and Type B series, 


From the form of d(w) in (2), we have 

Develop the right-hand member of (3) in powers of 
w and we obtain 

Thus we have by integration, remembering that ^(0) = 1 , 

log ^(tt;)=5 pwi-k-j^ pqiwiy-j^ pqip-qK^ifiy-i , 

or writing 

ft. 6, 

(4) e{w)^e '• ^' 

we have 

(5) bi = sp , b2=spq , ^>3= -spqip-q), 

We now write 

(6) 0(w) == e*.««--*.«'./2 [1 _ AiiwiY+ A^iwiY 

Since it follows from (2) that 9{w) is an entire func- 
tion of w, the series in brackets in the right member of 
(6) converges since it is the quotient of an entire function 
6(w) by an exponential factor with no singularities in the 
finite part of the plane. 


From (4), (5), and (6) we have 

(7) Ai = ^spq{p-q) , A4 = ^spq{l-6pq) , . 

Inserting 6{w) from (6) in (1), we have 

(8) ^oW=^ I ^we-t*-^)«*-^'^/2[l-^3M^ 

+A,{wiy ] . 

If we write 

(9) ^{x)=^ j (/z£;e-(*-*x)«---^«'/2 , 

we have from (8), 

(10) Boix)=m+A,^^+Aj^+..-.. 

If, however, 62 is not small, we may use in place of 
Q (jt:)the function (})(x) defined as 




by changing the limits of integration from ± t to ± 00 
Moreover, we shall prove that 

To prove this, we write 

<t>{x) = 2^ e-*««''/2 cos [w(x-bi)]dw 



The second term vanishes because the sine is an odd 
function. Since the cosine is an even function, we may 

(12) 0W = - I ** e-*»«"^^ cos lw{x-bi)]dw . 

Differentiation with regard to x gives 

^^= -- Pe-^'^V2 ^ sin [w{x-bi)]dw . 

ax TT^o 

Integrate the right-hand member by parts and we have 

ax 02V Jq 

__ (x-bi) , . 
= ^ — 4>{,x) . 

Then by integration, 

(13) <f>{x)^Ae■'<"^^>*^^t. 

To find A, let a; = ^i in (12) and (13). This gives the well 
known definite integral 

A=- I '°c-*a«^/2j^^. 1 

""Jo ^ ' •*"'"(27rW^^^ 
Hence, we have 



as given in (11). 



Therefore we may write in place of (10) the Type A 

(14) S.(.) = ^(.)+^,^+^,^+..... 

if (72 = 62. 

To study the degree of approximation secured in 
changing the limits of integration from ± tt to ±00 in 
passing from Q(x) to <f>(x)y we observe that 

<t>ix)-n(x)=^ C"^ dwe-'^^y^ cos [{x-h)] 

and hence 

[<t>{x)-n{x)]<- \dwe-'*^/^ = ~ r'°g-^V2 ^ 

if X2 = (r2w;8. 

Hence, the difference approaches zero very rapidly 
with increasing values of <r as may be seen by using the 
values of the last integral written corresponding to values 
of X=l, 2, 3, 4, . . . . , in a table of this probability 
integral. A similar examination for the derivatives of Q 
and will show that their differences similarly approach 


To develop (2) in powers of />, we first write, since 

' d]oge(w) ^ ispe^ 
(15) -j dw i-^(i-e«.) 

a convergent series since 

Since ^(0) = 1, we obtain by integration 

(16) \ogeiw) = 

Hence, writing 

(17) e(w)^e-'f''-'''Hl-\-B2(l-e'^y+Bsil-e'^y+ J , 

we have 

Now, from (1) and (17), 

(18) Bo{x)=:^ \ dwe-'^-'^^'-''^^[i+B2{l-e'^y 


^(3c) = ^ I dwe-'^-'P^'-''^^ = j Q(2e;, x)dw . 

Then let 

^^(:r)=:^(:r)-^(x~l) = ^ C' dwe-^-'^^'"'"^^ 

— L j '^^g-(*-i)t«-*p(i-««^) 
= j {l-e'^)Q{Wy x)dw . 


Then ^^{x) = ) ' (1 - e'^YQiw, x)dw , 

AV'W= j {\-e'^yQ{w,x)dw , 

Hence, we have 

(19) B,{x) =^W+52AVW+^3A'^ W-f ..... 
To give other forms to 

iHx)=~- I e-«^-*Mi->) dw , 
we may write 
^^^^ ^^) e-'«-+^^«""' dw 

= ^ j '^ e-^l 1+5/) e^+^-^e'^+ ]dw 

(20) =gIl^r!HL^4-,^^^"(^-^)^4-.... 

gr L ic ^ x—1 

(spy sin (x-r)T 
"^ r! x-r "^ ' * 

c-*^ . [1 sp , s^i^ 
= sm irx\ ^+ ^. / /^x — • • • • 


sin TTxl 1 X X^ 


(21) =e- 



' (x-r)r\ 

if sp is replaced by X. 

The foregoing analytical processes can be easily justi- 
fied by the use of the properties of uniformly convergent 
series. When x approaches an integer r, it is easily seen 
from (20) that each term approaches zero except the 

e~'^ {spy sin {x — r)ir 
V r\ x — r * 

and this term has as its limit the Poisson exponential 

r! f! ' 

The formula (21) may therefore be regarded as defining 
the Poisson exponential e~'^\'/xl for non-integral values 
of x. 

The development in series (19) is useful only when p 
is so small that sp is not large, say sp ^ 10, s being a large 
number. In this case, sp is likely to be too small to allow 
an expansion in a Type A series. Otherwise, the develop- 
ment in Type A is better suited to represent the terms of 
the binomial series. 

While the above demonstration is limited to the rep- 
resentation of the law of probability given by terms of a 
binomial, Wicksell has gone much further in the paper 


cited above in showing a line of development which sug 
gests tne use of the Gram-Charher series for the represen- 
tation of the law of probability given by terms of the 
hypergeometric series, thus representing the law of prob- 
abihty which gives the basis of the Pearson system of 
generalized frequency curves. Unfortunately, the demon- 
stration of this extension would require somewhat more 
space devoted to formal analysis than seems desirable in 
the present monograph. Hence we merely state the above 
fact without a demonstration. 

57. The values of the coeflacients of the Type A series 
obtained from the biorthogonal property. If in (14) we 
measure x from the centroid as an origin and in units 
equal to the standard deviation, c, we may write in place 
of (14) 

(22) F(:*:)=0(:c)-ffl30(3)(^)_^^^^(4)(^)_^ .... 

and <^^''\x) is the wth derivative of <^(x) with respect to x. 
The coefficients an(w = 0, 3, 4, . . . .) in the Type A 
series may be easily expressed in terms of moments of area 
under the given frequency curve about the centroidal 
ordinate because the functions </)^"^(a:) and the Hermite 
polynomials EJ^x) defined by the equation 

form a biorthogonal system. Thus, 

(23) f^ <f>^^^x)nm(x)dx = (m4=»), 


(24) and I <t>^lHm{x)dx = ^^-^^ (w=«) , 

and this biorthogonal property affords a simple method of 
determining the coefficients in the Type A series. 
To prove (23) and (24) we may write 


f^ <t>^«\x)Hm{x)dx=^{-l)- f^ <l>{x)Hn{x)HUx)dx 

=:(_l)m+n r^^i^)(x)n^{x)dx . 

Integration by parts gives 

n </><«) (X) Hmix)dx = [0(»-l) ix) Hm{x)^ 

- C"<i>^--'\x)HUx)dx=- C^ <i>^--'Kx)E'm{x)dx . 

J-oo J-oo 

Continuing until we have performed w4- 1 successive inte- 
grations by parts, we obtain, assuming n>m, 


(-!)«+! r°° (i>(»-^-^\x)nJ*^-^'Hx)dx , 


where W^'^^^x) is the (w+l)th derivative of Hmix). 
Since Hm{x) is a pol3Tioinial of degree miaXy its (w4- l)th 
derivative vanishes and we have 

(26) ( <l>^^\x)nM)dx = 

C^ 0(«)i 

for n>m. But from the fonn of (25), it is obvious that 
we could equally well prove (26) for m>n. For w = w, we 
proceed as above with m successive integrations. We 
then have, if we replace m by n, 

r°° 0^«J (x) Hn{x)dx = (- 1)« r "^ 0(«-«) {x) En^*^ {x)dx 

J—oa J-co 

But the nth derivative E^^Kx) of the polynomial En(x) 
is equal to n !. Hence, 


j <^(«>(x)H„(ic)Jic=(-l)''w! I <i>{x)dx 

J-co J-co 

By multiplying both members of (22) by En(x) and 
integrating under the assumption that the series is uni- 
formly convergent, we have 

C^ F{x)En{x)dx^anC^ <i>^rO(y^)B^{x)dx^{--\Y^ 


since by application of (26) all terms of the right-hand 
member vanish except the. one with the coefficient ««• 

a(-iy p F(x)nn{x)dx 
(28) .,= ^, . 

Moreover, to determine an numerically for an observed 
frequency distribution we replace F(x) in (28) by the 
observed frequency function /(jc). 

For purposes of numerical application, let us now 
change back from the standard deviation as a unit 
to measuring x in the ordinary unit of measurement 
(feet, pounds, etc.) involved in the problem, but still keep 
the origin at the centroid. This means that we replace 
X in (28) by x/<t. If in these units /(ic) gives the observed 
frequency distribution, we may write in place of (28) 


an^<J ^^- 1 m Hn{x/a) dx/a 
I =^^f"fix)Hn{x/a)dx. 

Since En{x/a) is a polynomial of degree w in a;, the co- 
efficients an are thus given in terms of moments of area 
under the observed frequency curve. It is then fairly 
obvious that the determination of the moments of area 
under the frequency curve plays an important part in the 
Gram-Charlier system as well as in the Pearson system. 
58. The values of the coefficients of Type A series 
obtained from a least-squares criterion. It may be proved 
by following J. P. Gram that the value of any coefficient 
an obtained in § 57 by the use of the biorthogonal property 


is the same as that obtained by finding the best approxi- 
mation to f{x), in the sense of a certain least-squares 
criterion, by the first m terms of the series (m^n). To 
prove this statement, we may proceed as follows: Con- 
sider the series 

F(x) = ao(f>{x)+ai(j>^'^{x)+ +a^<t>^^\x) 

for the representation of an observed frequency function 
f{x). The least-squares criterion^^ that 

(30) V=r^^[f{x)-F{xWdx 

shall be a minimum leads, to values of coefficients given 
in § 19. 

To prove this, we square the binomial f(x)—F{x) and 
differentiate partially with regard to the parameters 
do, ^3, . . . . , a«. This gives 

dv_ d r-^ f{x)F{x)dx d r-f^, .,, 1 , 

=2(-l)«+' f^fix) Hn{x)dx-\-2an C^[Hr.{x)f <i>{x)dx 

J -Vi J — CO 



= rjallHo(x)Y+a\[H,ix)Y-{- .... 

+ aUnm{xm<t>{x)dx , 
the product terms vanishing because of (26). 

Making dV/dan = 0, we have 

(31) 2(-l)»+^ r j{x) Hnix)dx+2an f" lBnix)Y 4>{x)dx = 

J — oo J — ca 


(32) r°° [Hr.{x)f4>{x)dx={- \Y (^ 4>^»^(x)Hn{x)dx^^ * 

From (31) and (32), we have 

- [^ fix) Hn{x)dX , 

which is identical with the value obtained by the use of 
the biorthogonal property. 

59. The coefficients of a Type B series. In consider- 
ing the determination of the coefficients Cq, Ci, C2, . . . . , 
of the Type B series, we shall restrict our treatment to a 
distribution of equally distant ordinates at non-negative 
integral values of x^ and shall for simplicity consider the 
representation by the first three terms of the series. That 
is, we write 

Fix) = Coiix)+CiArHx) +C2AV W , 


for ar = 0, 1, 2, . . . . Let fix) give the ordinates of the 
observed distribution of relative frequencies, so that 

E/(*) = i. 

Equating sums of ordinates and first and second 
moments /x (and ^2 of ordinates of the theoretical and 


observed distributions, we may now determine the co- 
efficients approximately from the equations : 

' E lcoHx)+c,^^|^ix)+c,^'^p{x)] = Zf{x)=-l 
(33) I E^ [coHx)-hciAHx)+C2A'Hx)]==T.xfix)=n\ 
. T.AcoHx)+c,AHx)+C2A'rP{x)] = Z^y(:^) = mJ 

Before solving these equations for Co, Ci, and C2, we 
may simplify by the substitution of certain values which 
are close approximations when we are dealing with large 
numbers. Thus, we recall that we have derived in § 14, 
Chapter II, the following approximations: 


We may next easily obtain the following further ap- 
proximate values: 


= X-X-1--1 . 
Similarly, it is easily shown that 

Y.xA'iix) = 0, 
E^A^(;c) = -2X-l , 


Substituting these values in equations (33) ^ we obtain 

Co = 1 , Xco — Ci = /ii , 


If we take X = /xi, we have the coefficient Ci=0. 

Then expressing the second moment ni in terms of 
the second moment fi2 about the mean by the relation 

we have 

X+X2+2C2 = M2+X2 , C2 = §(ai2-X) . 

Hence, we write 


when X is taken equal to the first moment mi, which is 
the arithmetic mean of the values of the given variates. 
It is fairly obvious that this application of moments 
to finding values of the coefficients can be extended to 
more terms if they were needed in dealing with actual 


1. Page 1. fimile Borel, Elements de la theorie des probabilitis, p. 167. 
Le hasard, p. 154. 

2. Page 23. Julian L. Coolidge, An introduction to mathematical 
probability (1925), pp. 13-32. 

3. Page 30. Tchebycheff, Des valeurs moyennes. Journal de Math6- 
matique (2), Vol. 12 (1867), pp. 177-84. 

4. Page 30. M. Bienaym6, Considerations a I'appui de la decouverie 
de Laplace sur la loi de prubabilite dans la methode des moindres carris 
Comptes Rendus, Vol. 37 (1853), pp. 309-24. 

5. Page 31. E. L. Dodd, The greatest and the least variate under 
general laws of error, Transactions of the American Mathematical Society, 
Vol. 25 (1923), pp. 525-39. 

6. Pages 31 and 47. Some writers call this theorem the "Bernoulli 
theorem" and others call it the "Laplace theorem." It has been shown 
recently by Karl Pearson that most of the credit for the theorem should 
go to De Moivre rather than to BemouUi. For this reason we call the 
theorem the "De Moivre-Laplace theorem" rather than the "Bernoulli- 
Laplace theorem." See Historical note on the origin of the normal curve of 
errors^ Biometrika, Vol. 16 (1924), p. 402; also James Bernoulli's theorem, 
Biometrika, Vol. 17 (1925), p. 201. 

7. Page 32. For a proof see Coolidge, An introduction to mathematical 
probability, pp. 38-42. 

8. Pages 37 and 68. James W. Glover, Tables of applied mathematics 
(1923), pp. 392-411. 

9. Page 37. Karl Pearson, Tables for statisticians and biometricians 
(1914), pp. 2-9. 

10. Page 39. Poisson, Recherches sur la probability des jugemenis, 
Paris, 1837, pp. 205 ff. 

11. Page 39. Bortkiewicz, Das Gesetz der kleinen Zahlen, Leipzig, 

12. Page 48. For various proofs of the normal law, see David Brunt 
The combination of observations (1917), pp. 11-24; also Czuber Beobach- 
tungsfehler (1891), pp. 48-110. 

13. Page 50. Karl Pearson, Mathematical contributions lo the theory 
of evolution, Philosophical Transactions, A, Vol. 186 (1895). pp. 343-414. 


174 NOTES 

14. Page 50. Karl Pearson, Supplement to a memoir on skew variation 
Philosophical Transactions, A, Vol. 197 (1901), pp. 443-56. 

15. Page 50. Karl Pearson, Second supplement to a memoir on skew 
variation, Philosophical Transactions, A, Vol. 216 (1916), pp. 429-57. 

16. Page 60. J. P. Gram, Om Raekkeudviklinger (1879) (Doctor's 
dissertation), Copenhagen, 1879; also Uber die Entwickelung reeller Func- 
tionen in Reihen mittelst die MHhode der kleinsten Quadrate, Journal fur 
Mathematik, Vol. 94 (1883), pp. 41-73. 

17. Page 60. T.N.Tlnele, Almindelig I agttagelseslaere, CopenhAgen, 
1889; cf. Thiele, Theory of observations, 1903. 

18. Page 60. F. Y. Edge worth, The asymetrical probability-curve, 
Philosophical Magazine, Vol. 41 (1896), pp. 90-99; also The law of error, 
Cambridge Philosophical Transactions, Vol. 20 (1904), pp. 36-65, 113-41. 

19. Page 60. G. T. Fechner, Kollektivmasslehre (ed., G. R. Lipps), 

20. Page 60. H. Bruns, Uber die Darstellung von Fehlergesetzen, 
Astronomische Nachrichten, Vol. 143, No. 3429 (1897); also Wahrschein- 
lichkeitsrechnung und Kollektivmasslehre, 1906. 

21. Page 60. C. V. L. Charlier, Uber das Fehlergesetz, Arkiv for 
Matematik, Astronomi och Fysik, Vol. 2, No. 8 (1905), pp. 1-9; also 
Uber die Darstellung "wUlkurlicher Funktionen, Arkiv for Matematik 
Astronomi och Fysik, Vol. 2, No. 20 (1905), pp. 1-35. 

22. Page 60. V. Romano vsky. Generalization of some types of fre- 
quency curves of Professor Pearson, Biometrika, Vol. 16 (1924), pp. 106-17. 

23. Page 63. Wera Myller-Lebedeff, Die Theorie der Intcgralglcich- 
ungen in Anwendung auf einige Reihenentwicklungen, Mathematische 
Annalen, Vol. 64 (1907), pp. 388-416. 

24. Pages 65 and 156. S. D. Wicksell, Contributions to the analytical 
theory of sampling, Arkiv for Matematik, Astronomi och Fysik, Vol. 17, 
No. 19 (1923), pp. 1-46. 

25. Pages 67 and 169. In the use of the least-squares criterion that 
V in (16), §20, and in (30), §58, shall be a minimum, a question natur- 
ally arises as to the propriety of weighting squares of deviations with 
the reciprocal l/<f>{x) = {2iry/^e^^/'^ of the normal function. Gram used 
this weighting without commenting on its propriety so far as the writer 
has been able to learn. One fairly obvious point in support of the 
weighting is its algebraic convenience. 

26. Page 68. N. R. Jorgensen, Undersogelser over Frequensflader og 
Korrelation (1916), pp. 178-93. 

27. Page 74. H. L, Rietz, Frequency distributions obtained by certain 
transformations of normally distributed variates. Annals of Mathematics, 
Vol. 23 (1922), pp. 292-300. 

NOTES 175 

28. Page 74. S. D. Wicksiell, On the genetic theory of frequency, Arkiv 
for Matematik, Astronomi och Fysik, Vol. 12, No. 20 (1917), pp. 1-56. 

29. Page 75. E. L. Dodd, The frequency law of a function of variables 
with given frequency laws, Annals of Mathematics, Ser. 2, Vol. 27, No. 1 
(1925), pp. 12-20. 

30. Page 75. S. Bernstein, Sur les courbes de distribution des proba- 
biliies, Mathematische Zeitschrift, Vol. 24 (1925), pp. 199-211. 

31. Page 81. Francis Galton, Proceedings of the Royal Society, Vol. 
40 (1886), Appendix by J. D. Hamilton Dickson, p. 63. 

32. Page 81. Karl Pearson, Mathematical contributions to the theory 
of evolution III, Philosophical Transactions, A, Vol. 187 (1896), pp. 253- 

33. Page 81. G. Udny Yule, On the significance of the Bravais's 
formulae for regression, etc., Proceedings of the Royal Society, Vol. 60 
(1897), pp. 477-89. 

34. Page 84. E. V. Huntington, Mathematics and statistics, American 
Mathematical Monthly, Vol. 26 (1919), p. 424. 

35. Page 86. H. L. Rietz, On functional relations for which the co- 
ejicient of correlation is zero, Quarterly Publications of the American 
Statistical Association, Vol. 16 (1919), pp. 472-76. 

36. Page 91. Karl Pearson, On a correction to be made to the correla- 
tion ratio, Biometrika, Vol. 8 (1911-12), pp. 254-56; see also Student, 
The correction to be made to the correlation ratio for grouping, Biometrika, 
Vol. 9 (1913), pp. 316-20. 

37. Page 92. Karl Pearson, On a general method of determining tlie 
successive terms in a skew regression line, Biometrika, Vol. 13 (1920-21) 
pp. 296-300. 

38. Page 100. Maxime B6cher, Introduction to higher algebra (1912), 
p. 33. 

39. Page 101. L. Isserlis, On the partial correlation ratio, Biometrika 
Vol. 10 (1914^15), pp. 391-411. 

40. Page 101. Kari Pearson, 0» /Ae par/*a/ c<?rre/a/jfm ro/w, Proceed- 
ings of the Royal Society, A, Vol. 91 (1914-15), pp. 492-98. 

41. Page 102. H. L. Rietz, Urn schemata as a basis for the developmcn 
of correlation theory, Annals of Mathematics, Vol. 21 (1920), pp. 306-22 

42. Page 103. A. A. Tschupiovt ,Grundbegriffe und Grundprobleme 
der Korrelationstheorie (1925). 

43. Page 109. H. L. Rietz,, On the theory of correlation with special 
reference to certain significant loci on the plane of distribution Annals of 
Mathematics (Second Series), Vol. 13 (1912), pp. 195-96. 

44. Pages HI and 112. Karl Pearson, On the theory of contingency 

176 NOTES 

and its relation to association and normal correlation^ Drapers' Company 
Research Memoirs (Biometric Series I), (1904), p. 10. 

45. Page 111. E. Czuber, Theorie der Beobachtungsfehler (1891), pp. 

46. Page HI. James McMahon, Hyperspherical goniometry; and its 
application to correlation theory for N variables, Biometrika, Vol. 15 (1923), 
pp. 192-208; paper edited by F. W. Owens after the death of Professor 

47. Page 112. Karl Pearson, On the correlation of characters not 
quantitatively measurable, Philosophical Transactions, A, Vol. 195 (1900), 
pp. 1-47. 

48. Page 112. Karl Fearson, On further methods of determining cor- 
relation, Drapers' Company Research Memoirs (Biometric Series IV) 
(1907), pp. 10-18. 

49. Page 112. Warren M. Persons, Correlation of time series (Hand- 
book of Mathematical Statistics [1924], pp. 150-65). 

50. Page 112. C. Gini, Nuovi contributi alia teoria delle relazioni 
statistiche, Atti del R. Istituto Veneto di S.L.A., Tome 74, P. II (1914^15). 

51. Page 112. Louis Bachelier, Calcul des probabilites (1912), chaps. 
17 and 18. 

52. Page 113. Seimatsu Narumi, On the general forms of bivariate 
frequency distributions which are mathematically possible when regression 
and variation are subjected to limiting conditions, Biometrika, Vol. 15 
(1923), pp. 77-88, 209-21. 

53. Page 113. Karl Pearson, Notes on skew frequency surfaces, Bio- 
metrika, Vol. 15 (1923), pp. 222-44. 

54. Page 113. Burton H. Camp, Mutually consistent multiple re- 
gression surfaces, Biometrika, Vol. 17 (1925), pp. 443-58. 

55. Page 126. 0. Bohlmann, Formulierung und begriindung zweier 
Hilfssatze der mathematischen Statistik, Mathcmatische Annalen, Vol. 74 
(1913), pp. 341-409. 

56. Page 134. Burton H. Camp, Problems in sampling. Journal of 
the American Statistical Association, Vol. 18 (1923), pp. 964r-77. 

57. Page 138. Student, The probable error of a mean, Biometrika, 
Vol. 6 (1908-9), pp. 1-25. 

58 Page 138. Karl Pearson, On the distribution of the standard de- 
viations of small samples: Appendix I. To papers by "Student" and R. A. 
Fisher, Biometrika, Vol. 10 (1914-15), pp. 522-29. 

59. Page 139. R. A. Fisher, Frequency distribution of the values of the 
correlation coefficient in samples from an indefinitely large population, 
Biometrika, Vol. 10 (1914-15), pp. 50.7-21. 

NOTES 177 

60. Page 139. H. E. Soper and Others, On the distribution of the 
correlation coefficient in small samples. Appendix II to the papers of "Stu- 
denr and R.A. Fisher, Biometrika, Vol. 11 (1915-17), pp. 328-413. 

61. Page 142. Karl Pearson, On generalised Tchebychejff theorems in 
the mathematical theory of statistics, Biometrika, Vol. 12 (1918-19), pp 

62. Page 143. M. Alf. Guldberg, Sur le theoreme de M. Tchehychef, 
Comptes Rendus, Vol. 175 (1922), p. 418; also Sur quelques inegalites dans 
le calcul des probabilites, Vol. 175 (1922), p. 1382. M. Birger Meidell, Sur 
un probleme du calcul des probabilites et les statistiques mathemaliques, 
Comptes Rendus, Vol. 175 (1922) , p. 806; also Sur la probabilite des erreurs, 
Comptes Rendus, Vol. 176 (1923), p. 280. B. H, Camp, A new generaliza- 
tion of Tchebycheff's statistical inequality, Bulletin of the American Mathe- 
matical Society, Vol. 28 (1922), pp. 427-32. Seiniatsu Narumi, On further 
inequalities with possible applications to problems in the theory of probability, 
Biometrika, Vol. 15 (1923), p. 245. 

63. Page 147. W. Lexis, Vber die Thcorie der Sfubilitdt slatisticher 
Reihen, Jahrbuch fiir National Ok. u. Statistik, Vol. 32 (1879), pp. 60-98. 
Abhandlungen zur theorie der bevolkerungs und moral- statistik, Kap. V-IX 

64. Page 147. C. V. L. Chariier, Vorlesungen iiber die Grundziige der 
matltematischen Statistik (1920), p. 5. 

65. Page 147. J. M. Keynes, A treatise on probability (1921), p. 393. 

66. Page 153. In this connection, the expression "L^ I" meanc 
"Ls 1 apart from chance fluctuations." 

67. Page 153. Handbook of Mathematical Statistics (1924), pp. 88-91 
C. V. L. Chariier, Vorlesungen iiber die Grundziige der mathematischen 
Statistik (1920), pp. 38-42. 

68. Page 154. Birth statistics for the registration area of the United 
Stales (1921), p. 37. 

(Numbers refer to pages) 

Arithmetic mean and mathemati- 
cal expectation, 14-16 

Bachelier, 112 

Bernoulli, 2; distribution, 23; 

theorem of, 27-31; series, 146 
Bernstein, 75 
Bertrand, 109 
Bielfeld, J. F. von, 2 
Bienayme-Tchebycheflf criterion, 

28-29; generalization of, 140-44 
Binomial distributions, 22-27, 51 
Bdcher, 175 
Bohlmann, 126 
Borel, 1 

Bortkiev/icz, 39 
Bravais, 3 
Bruns, 49, 60 
Brunt, 173 

Camp, 113, 134, 143, 144 

Carver, 75 

Cattell and Brimhall, 38 

Charlier, 2, 49, 60-67, 156-77 

Coefficient of alienation, 87 

Coolidge, 23, 173 

Correlation, 77-113; meaning of, 
77-78; regression method, 78- 
103; correlation surface method, 
79, 104-11, correlation coeffi- 
cient, 82; linear regression, 84; 
non-linear regression, 88; corre- 
lation ratio, 88-91; multiple, 92- 
102; partial, 98-101; standard 
deviation of arrays — standard 
error of estimate, 87-90, 95; 
multiple correlation coefficient, 
97; partial correlation coeffi- 

cient, 98-99; multiple correla- 
tion ratio, 101; normal correla- 
tion surfaces, 104-11; of errors, 
Czuber, 173 

De Moivre, 2, 3 

De Moivre-Laplace theory, 31-38, 
43-45, 156 

Deviation: quartile, 38; standard, 

Dickson, J. Hamilton, 81 

Discrepancy, 26; relative, 27 

Dispersion: normal, 3, 153; sub- 
normal, 3, 153; supernormal, 4, 
153; measures of, 14, 27, 153 

Dodd, 31, 75 

Edgeworth, 2, 49, 60, 73 

Elderton, 53, 60 

Ellipse of maximum probability, 

Error; see Probable error and 

Standard error 
Euler, 3 
Excess, 71-72 

Fechner, 49, 60 
Fisher, R. A., 139 
Frequency, relative, 6 
Frequency curves: defined, 13; 

normal, 34, 47; generalized, 48- 

Frequency distribution, observed 

and theoretical, 12-14 
Frequency functions: defined, 13; 

of one variable, 46-76; normal, 

34, 47; generalized, 4S-76 




Galton, 81-82 

Gauss, 3, 47 

Generating function, 60, 75, 76, 

Gini, 112 

Glover, tables of applied mathe- 
matics, 37, 68 

Gram, 2, 49, 60 

Gram-Charlier series, 60, 61, 65, 
72, 75-76; development of, 156- 
77; coefficients of, 65-68, 165-70 

Guldberg, 143 

Hermite poljTiomials, 66, 75-76, 

Heteroscedastic system, 88 
Homoscedastic system, 88 
Huntington, 175 
Hypergeometric series, 52, 165 

Isserlis, 101 

Jacobi polynomials, 75 
Jorgensen, 68 

Laguerre polynomials, 76 
Laplace, 2, 3; see De Moivre-La- 

place theory 
Lexis, 2, 3; theory, 146-55; series, 

146, 150; ratio, 152 

Maclaurin, 3 

McMahon, 111 

Mathematical expectation, 14-16, 
116-17; of the power of a varia- 
ble, 18-21; of successes, 26 

Median, 134; standard error in, 

Meidel, 143-44 

Mode and most probable value 
17-18, 25 

Moments: defined, 18; about an 
arbitrary origin, 18; about the 
arithmetic mean, 19; applied to 
Pearson's system of frequency 

curves, 58; coefficients of Gram- 
Charlier series in terms of, 66, 
168, 170 
Most probable number of suc- 
cesses, 25 

Multiply correlation coefficient, 97 
Myller-Lebedeff, 174 

Narumi, 113, 143 

Normal correlation surfaces, 104- 

Normal frequency curve, 34, 47, 

50; generalized, 60, 61, 65-69, 


Partial correlation coefficient, 98- 

Pearson, 2, 47, 49; generalized fre- 
quency curves, 50-60, 75, 81, 92, 
101, 111-13, 138, 142, 144^5, 

Pearson's system of frequency 
curves, 50-60, 75 

Persons, 176 

Poisson, 3; exponential function, 
39-45, 61, 164; series, 148 

Population, 4 

Probability: meaning of, 6-11; a 
priori, 10; a posteriori, 10; statis- 
tical, 10 

Probable eUipse, 108 

Probable error, 39, 127-30, 132 

Quartile deviation, 39 
Quetelet, 3 

Random sampling fluctuations, 

Regression: curve defined, 79; lin- 
ear, 84, 93; non-linear, 88-91, 
101; surface defined, 93; method 
of, 78-103 

Relative frequency and probabil- 
ity, 6-11 

Rietz, 174, 175 

Romanovsky, 60, 75 



Scedastic curve, 87 

Sheppard, 37 

Simple sampling, 22-44, 114^36 

Skewness, 68-71 

Small samples, 138-40 

Soper, Young, Cave, Lee, and 
Pearson, 139 

Standard deviation, 27, 82; of 
random sampling fluctuations, 
116; of sum of independent vari- 
ables, 136; see Standard error 

Standard error: defined, 119; in 
class frequencies, 119-21; in 
arithmetic mean, 126-27; in ^th 

moment, 130-32; in median, 
134; in averages from small 
samples, 138-40 

Stirling, 3, 32 

Student, 138 

Tchebycheff, 2, 28-30, 140-44 
Thiele, 49, 60 
Tschuprow, 102 

Whittaker, Lucy, 43 
Whittaker and Robinson, 64 
Wicksell, 65, 74, 164 

Yule, 81 

iH US A J 







QA 273.R5 

3 9358 00236269 4 





Rietz- Henrv Lewis 


JC (0 



• (d 
I QQ 6 
LC 0) 

^ 0) (d 
CO (d 0) 

^ CO-H 


Cv • 
Q? <di5 

a e :3 



-H (d • 

•H F-i 

« 1-* 


(< • 

o c 

0) » 

H a 
^1^ (d 


CO (d 

^ a-H 

(d 000 «l 


(fl :) (d